21st EURALEX International Congress: Lexicography and Semantics
Words of Welcome:
Dr. Nina Obuljen Koržinek, Minister of Culture and Media
Dr. Željko Jozić, Institute for the Croatian Language, Director
Prof. Dr. Marko Tadić, CLARIN HR Director
Welcome and Opening of Congress:
Dr. Annette Klosa-Kückelhaus, EURALEX President
Dr. Kristina Š. Despot, Chair of the EURALEX 2024 Organising Committee
Practical, commercial lexicography in the United States, in particular, is a field that relies heavily on tradition, and it has been loath to abandon the tried-andtrue methods of corpus creation, analysis, and defining that have been established 16since the time of Murray. Yet frame semantics has provided a broader lens through which the practical lexicographer can view meaning, and its integration (though slow) into the practice of lexicography has yielded defining methods that are more user-oriented while giving the lexicographer tools to move beyond their own unconscious or implicit biases – something that is increasingly important in successful modern lexicography. But technological and social changes in the last several decades – the ease with which mis- and disinformation moves into the mainstream, the rise of generative AI and the regular presentation of generated text as natural language, the proliferation of varieties of English accessible to the lexicographer that are sometimes themselves removed from context, and the changing ways in which online dictionaries are used – have presented difficulties to the practical lexicography who seeks to integrate frame semantics deeper into their practice. This paper will present case studies on the successes of integrating frame semantics into lexicographical practice, and the current challenges that lexicographers face when the “frame” itself is illusory, shifting, or debated.
The look-up behaviour of dictionary users has an established place in lexicographic research (Bergenholtz & Johnson, 2005; Lemnitzer, 2001; Lorentzen & Theilgaard, 2012; Trap-Jensen et al., 2014). It has been used with some success to improve the quality of the interaction between the dictionary and its users, such as through discovering users’ typical search patterns, their strategies and errors, as well as for fine-tuning the dictionary interface to better serve users. In this study, we leverage user look-up frequency in the English Wiktionary. Research so far has identified a robust relationship between lexical frequency and look-up frequency across various languages and dictionaries (De Schryver et al., 2019; Koplenig et al., 2014; Müller-Spitzer et al., 2015). More recently, other factors have been shown to have an influence on look-up frequency: a word’s age-ofacquisition; its polysemy status (whether the word has one or more senses); prevalence in the speaker population (Lew & Wolfer, 2024); and possibly also part of speech. Centrally to this research, our preliminary findings suggest that the CEFR level (Council of Europe, 2020, pp. 36–37) explains further additional variance in look-up behaviour, beyond what the other lexical factors are already telling us. Thus, the research so far shows that dictionary look-ups can be predicted, supplying information that is of great practical utility in compiling new dictionaries, as well as in improving existing dictionaries so they can serve their users better. However, the interesting question we want to tackle here is: can we extract additional insights from the look-up data itself that go beyond lexicography? Or, to paraphrase a famous statement, ask not what you can do for your dictionary — ask what your dictionary can do for you! A known challenge in language learning is the compilation, updating, and expansion of CEFR-graded vocabulary lists: a task that is highly labour-intensive, and has tended to use methodologies that are not always transparent, welldocumented, or readily replicable. In this connection, our study proposes using dictionary look-up data, alongside a few other relatively easy-to-obtain lexical properties of words, to predict (or impute) the CEFR level of words, as shown in Figure 1. Our research tries to address the following three fundamental questions: 1. Are classification algorithms able to predict CEFR levels with an accuracy higher than a random baseline (and if so, how much higher)? 2. Which specific algorithm performs best on this task? 3. How much agreement is there among different algorithms in categorizing words into CEFR levels? Our main goal is to develop and test an automated and reliable process for generating lists of candidate words that could then be fruitfully added to existing CEFR lists. When set to predict candidates for three broad levels A, B, C, and using the English Vocabulary Profile (Cambridge University Press, 2015; Capel, 2012, 2015) for training, our best-performing models (Regression Trees, Ordinal Logistic Regression, and Random Forests) returned the following words among candidates for A level: become, true, mom/ma, daddy, and bitch. For level B, we obtained, among others: whip, clay, chamber, commission, and deed. Some of the candidates appear to be worthy of inclusion, though probably not all of them, such as common swear words generally deemed inappropriate for educational contexts. The proposed method may present a convenient way to update CEFR vocabulary lists, but its feasibility crucially depends on a number of conditions. For the algorithms to be effectively trained, a substantial pre-compiled CEFR list is required as a starting point: our current estimate puts its minimum size at a few thousand items. The selection of predictors is another crucial aspect, and our study offers an initial set of predictors that may be refined in future research. Finally, the success of our approach depends on the availability of open dictionary data, including metrics such as views, number of senses, and part-ofspeech information.
Portuguese is the official language of nine countries and one territory. However, given the socio-historical contexts of these countries, its functional status varies greatly. In Brazil, Portugal, and São Tomé and Príncipe (Hagemeijer, 2018), Portuguese is the mother tongue for the majority of the population. In Angola and Mozambique, it is the majority vehicular language, typically as a second language (Inverno, 2018; Chimbutane, 2018). In Cape Verde, GuineaBissau (Pinto & Melo-Pfeifer, 2018, pp. 1112) and Timor-Leste (Carioca, 2016), it is a minority language. In Equatorial Guinea, it only has an official status from a formal point of view (Correia, 2019, p. 125). Portuguese has been codified and standardized in Portugal and Brazil, with the European variety being adopted as the norm in the other countries. Nevertheless, researchers have shown that local varieties of Portuguese have been emerging, and descriptions of Mozambican Portuguese (Gonçalves P., 2010; Firmino, 2011), Angolan Portuguese (Adriano, 2015; Inverno, 2009) and Sao Tomean Portuguese (Gonçalves R., 2016; Bouchard, 2017) can be found. In spite of this reality, there are no general monolingual dictionaries describing varieties of Portuguese other than the ones used in Brazil and Portugal (although the compilation of a Dictionary of Mozambican Portuguese is currently underway; see Machungo & Firmino, 2022). It is in this context that the Research Centre for Linguistics Studies at the University of Coimbra (CELGA-ILTEC) is currently developing resources and methodologies for a dictionary of Portuguese as a pluricentric language, which aims to cover all the varieties of Portuguese. According to Atkins & Rundell (2008) and Lew (2015), the initial stage of any new lexicographic project is defining the profile of its users. Thus, we have created a survey to obtain information about the potential users of this new dictionary so that it can be designed to meet their needs: the Survey on the Use of Dictionaries of Portuguese in Portuguesespeaking Countries – USODIPO https://www.uc.pt/celga-iltec/usodipo/. USODIPO is the first large-scale survey of this type targeted at speakers of Portuguese in all Portuguese-speaking countries. It was administered in eight countries (Equatorial Guinea was not included) between April and July 2023. The methodology adopted is an online, self-administered, anonymous questionnaire. In this paper, we will first provide an overview of the situation of Portuguese in the countries and region where it is an official language. Next, we will present the survey, including its context of origin, objectives, methodology and structure. Finally, we will present some of the main results. We believe the results of the survey will provide a significant contribution to lexicography of Portuguese in general, and to lexicography of pluricentric languages specifically. Moreover, the results might shed light on lexicography carried out in other multilingual countries in which the language described in the dictionaries that are available and traditionally adopted by the population and in schools are not in alignment with the variety of the language as it is used by their nationals on a daily basis.
The objective of this study is to investigate how learners of Italian as a second or foreign language search for new meanings in online Italian dictionaries. Using eye-tracking technology, we carried out experiments inviting users to do exercises on ‘combinations of words’ while they consulted various dictionaries, including De Mauro – Internazionale and Garzanti Linguistica. Results should suggest how to make digital dictionaries more efficient and thus more useful for learners of Italian as a second or foreign language. In lexicology, the concept of ‘combination of words’ is wide and complex, and the terms used vary depending on different traditions. For Italian, for instance, one differentiates between ‘espressioni polirematiche’ (Voghera, 2004; Masini, 2011) and ‘collocazione ristrette’ (Marello, 1994; 1996; Faloppa, 2010). Definitions of ‘combination of words’ are important because they have implications for lexicography. They should be clear and consistent in order, first, to place them in specific slots in dictionaries and, secondly, for them to be quickly found by learners. Eye-tracking technology, and in particular saccades, fixations and regressions can tell us “what the mind is thinking about [...] and how much cognitive effort is being expended on this” (Conklin et al., 2018, p. 9) and “by recording the dictionary user’s exact gaze position, the technique offers a unique view of the details of dictionary consultation otherwise impossible to observe, thus promising new useful findings which could inform digital dictionary design” (Lew & de Schryver, 2014, p. 348). ‘Combinations of words’ are difficult to be identified by learners of Italian as a second or foreign language, because they are usually not labelled, and so it is not easy to distinguish collocations from examples. In this study, we considered ‘combinations of words’ where at least one of the two (or more) words are listed in the B2 knowledge level of the Italian Lexical Profiles according to the Common European Framework of Reference for Languages (CEFR). We collected some ‘espressioni polirematiche’ from the online dictionary of Italian De Mauro – Internazionale, and some ‘collocazioni’ (under ‘costruzioni’) in the Dizionario delle Collocazioni (Tiberii, 2012). Subsequently, we selected a narrower list of concepts belonging to the semantic field of the five human senses: la vista ‘the sight’, l’udito ‘the sound’, l’olfatto ‘the smell’, il gusto ‘the taste’, il pprox and gli occhi ‘the touch’ and ‘the eyes’, le orecchie ‘the ears’, il naso ‘the nose’, la bocca ‘the mouth’, le mani ‘the hands’, etc. For ‘espressioni polirematiche’ we then ended up with collocations like acqua in bocca, largo di mano, a lume di naso. We then presented two groups of participants, 12 learners and 12 native speakers, some on-screen cloze exercises (cf. Addendum 1) and used eye-tracking technology to see how they went about solving those exercises (cf. Addendum 2). In these exercises, the participants were tasked to complete blanks while consulting online Italian dictionaries. The results show that both groups of participants not only consulted the two dictionaries which we advised them to use (i.e., De Mauro – Internazionale and Garzanti Linguistica), but also other online dictionaries and even Google Translate. We further found that the participants of the two groups consulted dictionaries in the same way, namely, as if they were written pages of paper dictionaries, following the same directions (first vertically, searching for the right example; then horizontally, to be sure the meaning searched for was correct). Moreover, their gaze fixations are always more relevant on the left side of a web page (cf. Addendum 3). These studies confirm that users don’t know which online tools are more useful in their quest for collocational meaning; that they do not recognize nor distinguish ‘combinations of words’ in dictionaries in general; and that even when they use our recommended dictionaries, they don’t know where to find these groups of words in particular. From a bird’s eye perspective, we are working towards providing answers for how ‘Digital Native’ dictionaries should be conceived and reinterpreted on the basis of (i) studies on actual dictionary usage, (ii) results of our eyetracking experiments, (iii) lexicological studies in different traditions (i.e., trying to find a common ground in different languages, each with their terminological taxonomies), and (iv) their importance to the field of glottodidactics.
Sign language lexicography, a nascent subfield, remains relatively unexplored, primarily owing to the unique attributes of sign languages (McKee & Vale, 2017). The scarcity of sign language dictionaries is attributed to linguistic, financial, and social challenges (Vacalopoulou, 2020), with limited resources available since the pioneering Dictionary of American Sign Language on Linguistic Principles (Stokoe, Casterline, & Croneberg, 1965). This scarcity is particularly pronounced for Greek Sign Language (GSL) (Vacalopoulou, Efthimiou, & Vasilaki, 2018), leading to accessibility issues for GSL-native children. Addressing the dearth of resources for Greek Sign Language as the primary language (L1) in Greece, this presentation details the development of an online bilingual Modern Greek—Greek Sign Language school dictionary catering to senior elementary deaf children in Greece. To effectively facilitate Greek language education for deaf and hearing-impaired students, the dictionary establishes semantic links between words and signs, offering comprehensive translations and definitions in Greek to mitigate language ambiguity encountered by this student demographic. Developed as an online resource, the dictionary utilizes web technologies, ensuring accessibility for school and university students seeking proficiency in both Greek and Greek Sign Language through a userfriendly interface. The paper unfolds in several sections: firstly, a comprehensive description of sign language, with a focus on GSL, is presented along with a review of existing GSL lexicographical products. Subsequently, the paper advocates for the importance of lexicographic protocols in dictionary creation, referencing established protocols (Gouws & Prinsloo, 2005; Klosa, 2013), and delineates the criteria influencing the choice of the lexicographic protocol for this dictionary. An overview of the organizational plan is provided, followed by an exploration of the five conceptualization plan phases and a detailed presentation of the dictionary’s macrostructural, microstructural characteristics, and general features. The paper meticulously outlines the procedural aspects, lexicographic functions, and the challenges encountered during the dictionary creation process. It is a pedagogical dictionary that aims to strengthen students’ speech and is based on didactic approaches. The layout of the electronic dictionary will be alphabetical as well as thematic, icons will be used instead of lexicographic signs and illustrations will be used to make the environment friendlier. Besides, we must not forget that deaf and hard of hearing students, with natural sign language, are familiar with imagery. Lexicographical information is rendered in color. In terms of its macrostructure, it is a multifunctional pedagogical dictionary suitable for school use (children aged 9–12 years) and emphasis is placed on distinguishing the metaphorical from the literal meaning. Usage examples, synonyms and antonyms are also provided. Compilation of dictionaries must be a transparent process that follows strict lexicographic protocols, so that dictionaries are valid and reliable. Arbitrary dictionary style is often due to a lack of organization and coordination. However, the lexical graphic protocol is not a frequent practice because it is a timeconsuming process and therefore a reduced number of researchers follow it. This results in the production of dictionaries of low quality or those that do not meet the needs of the user (Tarp, 2009). The dictionary production process leads lexicographers to adhere to defined criteria of design and organization. A dictionary graphic protocol is a series of simple and defined evidence-based steps and choices that are recorded and applied consistently during dictionary development and are available to the scientific community to ensure transparency in dictionary compilation. It is essentially a process that determines the criteria of macrostructure, microstructure and mesostructure and all the theoretical issues related to the metalinguistic information of the dictionary. Following are some specific design steps for the dictionary. The lexicographic design of the dictionary was designed following the below phases: 1. preparation 2. data acquisition 3. computer processing 4. data analysis 5. release preparation. This research addresses pressing issues and augments existing knowledge in inclusive and pedagogical lexicography. The proliferation of resources such as the presented dictionary is pivotal in fostering equality and inclusivity within the Greek education system for the deaf community.
Pashto. Pashto is an Eastern Iranian language spoken in Afghanistan, Pakistan and by a large diaspora community across the globe. It is one of two official languages of Afghanistan and a regional official language in Pakistan’s Khyber Pakhtunkhwa Province. With about 15 million native speakers in Afghanistan and 30 million in Pakistan it is the second-most spoken Iranian language after Persian. Its accessibility to scholarly and lay communities is, however, not commensurate with its linguistic and geopolitical importance. Resources that exist were generally created for anachronistic purposes, do not satisfy the needs of modern user bases, and are insufficiently accessible. Open Pashto-English Dictionary (OPED). Our dictionary project aims to tackle these challenges head-on by creating a sustainable, adaptable, openly accessible, open-source, TEI Lex-0 conform online dictionary, designed with the wishes and requirements of the myriad user groups in mind. This 2 ½ year project is funded by the Austrian Academy of Sciences’ go!digital 3.0 programme and being carried out at the Academy’s Institute for Iranian Studies in cooperation with the University of Vienna’s Department of Finno-Ugrian Studies. As of March 2024, a working demo version of our dictionary can be found at oped.univie.ac.at. Practical Challenges. Pashto is by no means an undescribed language: ample lexicographic resources exist, but high-quality resources that do exist (a) have not been adequately digitized, if at all, (b) are out of date, (c) have substantial gaps in the information they contain about, e.g., the inflection of nouns and verbs, (d) use Russian, rather than English, as a metalanguage. Given our limited personnel and time resources, the efficient approach in raising the accessibility of the Pashto lexicon is to process what already exists: digitize, translate, and update existing resources and publish the resulting resource in an open and expandable format. Technological Challenges. Pashto uses a variation of the Arabic alphabet with additional graphemes not found in the Arabic or Persian alphabets (e.g., <ټ> for a voiceless retroflex plosive /ʈ/). Consequently, all tools we use must be able to handle not only rightto-left text and the correct joining of characters necessary in the Arabic script, but also support the usage of fonts that adequately realize all relevant symbols. This has been surprisingly challenging, with even infrastructures designed with “smaller” languages in mind turned out to not support the adequate display Pashto orthography, or the correct alphabetical ordering. This, coupled with desired flexibility regarding the depiction and search functionalities (due to the diverse user base we have in mind), has forced us to rely on “garage solutions” at times that might be seen as reinventing the wheel, but are in fact necessary to create the user interface we promised to create. Philological Challenges. Even the second edition of the by far most extensive and qualitative lexicographical resource on Pashto, Aslanov (1985), has become quite outdated, not only due to the passage of time since its publication but also by the political upheaval Pakistan and especially Afghanistan have gone through in recent decades. Pashto-speaking areas on both sides of the border have been subject to profound changes in administrative structures, resulting also in changes in administrative terminologies. Furthermore, Pashto speakers being forced into diaspora and refugee situations has altered the usage domains of the language profoundly. We are addressing lexical gaps and semantic changes within the lexicon by cooperating with native speakers, esp. active in the domain of translation. Through the creation of a morphological model, we also hope to make a corpus-based expansion of our dictionary possible in future. Our User Base. A matter deserving special attention especially when creating lexical materials for an under-resourced language is that one must expect a heterogenous user group: when few accessible lexical resources exist for a language, prospective users of a dictionary will include native speakers, foreign scholars from different theoretical backgrounds, computational linguists, etc. – i.e., groups of users for whom different resources are developed in the case of structurally strong languages such as English. The user interface we have created for our dictionary 178aims to be as customizable as possible in order to not just meet the needs of a small, idealized pool of users, but of a diverse user base with radically different desires and requirements.
We would like to introduce the results of the ELDI project (Electronic Lexical Database of Indo-Iranian Languages, Pilot module: Persian), launched in August 2020. One of the aims of the project was promoting the use of technologies in teaching languages. A website and a mobile application with the Persian–Czech dictionary were developed as the main planned results of the project. A new webcrawled Persian corpus was also created through cooperation with Comenius University in Bratislava and it was primarily used for the study and validation of lexical data, but it is now also directly linked to both applications and is open for use by students, teachers, and researchers in language studies and teaching. The participation of teachers from Charles University in Prague in the project will help to transfer the project outputs into practical teaching. This article also presents the results of a comparison of a dictionary and AI.
Traditional practices for naming species in the biological sciences often incorporate eponyms. However, the group of honourees is not very diverse, and many individuals have links to colonization. A grassroots movement is emerging within the biological sciences to give new scientific and/or common names to species that bear harmful eponyms. Approaches to renaming species include updating terminology planning processes, using more diverse and inclusive eponyms, re-instating pre-colonial names, and replacing eponyms with transparent terms. Many of these activities are in collaboration with Indigenous communities, as well as with other types of experts and the broader public. It is important for terminologists to be aware of these developments because they are well positioned to contribute to such discussions moving forward.
The COVID-19 pandemic has impacted numerous sectors at different levels and has imposed a radical change in the pace of life in societies across the globe. A partially technical vocabulary related to COVID-19 quickly became part of everyday life, introduced mainly by news and official bodies. To describe the characteristics of the terminology being disseminated in Brazil, the project Study and dissemination of COVID-19 terminology (2020–2024) was proposed. 245More specifically, it has been possible to observe the recurrence of lexical units closely related to political issues – which is the focus of this paper. Entries like gripezinha (small flu), tratamento precoce (early treatment), kit covid (COVID kit) and ministério paralelo or gabinete paralelo (parallel ministry or parallel office) will be explored in this paper. These examples show cases of politically marked terms related to the COVID-19 Brazilian pandemic context. In Brazil, the pandemic has been marked by science denialism, especially on the part of the federal government. Due to this political influence, particularly concerning interference in the Brazilian public health system, these terms, which are not officially recognized by the Brazilian and international medical community, were included in the proposed dictionary, supplemented with contextual information elucidating the political dimensions entailed.
This paper presents an innovative lexicographic approach embedded within an online resource currently under developement: ALMA: Linguistic Multimedia Atlas of Bio/Cultural Food Diversity. ALMA serves a dual purpose: firstly, to showcase linguistic diversity through culinary practices, and secondly, to scrutinise food marketing strategies through the analysis of language and paralanguage on packaging. The resource has two main components: Words for Food (WF) and Words for Choosing Food (WCF) lexicographic articles. The WF articles delve into the lexicon of common language and technicalities of food practices, recognizing the dynamic interaction between common and specialised knowledge of food production. On the other hand, the WCF articles employ a segmentation methodology to scrutinize food packaging, identifying functional and linguistic constituents. By summarizing individual labels and pinpointing issues regarding lack of transparency, the WCF articles facilitate consumer understanding and stimulate debates on labelling standards. This innovative lexicographic approach aims to 250empower consumers with linguistic and encyclopaedic knowledge while providing insights for legislators and manufacturers to enhance food labelling practice.
Greetings: Dr. Annette Klosa-Kückelhaus (EURALEX President), Dr. Kristina Š. Despot (Chair of the EURALEX 2024 Organising Committee)
Music: Unique Dubrovnik Duo
In Frame Semantics (Fillmore, 1976; 1982; 1985; 2006), all concepts are part of a semantic frame and are related in such a way that the activation of one lexical unit activates the entire frame. A frame is “a script-like conceptual structure that describes a particular type of situation, object, or event along with its participants and props” (Ruppenhofer et al., 2016). In its simplest form, a frame is an underlying conceptual structure into which the meanings of related terms fit. Framing experience involves activating and applying stored knowledge derived from similar contexts and situations. As a cognitive structuring device in the mind, frames are also reflected in language. They can thus be regarded as generalizations over sets of words which describe similar states of affairs, and which could even be expected to share similar sets of roles, and (to some extent) similar syntactic patterns (Baker, 2014). Frame analysis is relevant to many types of real-world scenarios and situation. One of these is romance scamming, an online deception in which a person pretends to have romantic interest in another individual with the intention of manipulating them for financial gain. As widely observed, romance scamming is a variant of the 419 advance fee scam (Levi et al., 2017), which stems from the Spanish Prisoner, a popular confidence game of the 16th century (Gillespie, 2017, pp. 217–218). Modern versions of this same deception are now conducted either partially or totally online. Since the parties involved primarily communicate through text messages, this deception is accomplished through language. The interlocuters in this scenario do not know each other (Olshtain & Treger, 2023, p. 385) and become acquainted solely through chatting. This means that contextual meaning is constructed by the addressee, based on the explicit and implicit information in the messages and the use of the right words. The decision to trust and believe in the text sender largely depends on this ‘constructed’ context, which may or may not correspond to reality. Using a fabricated identity, the fraudster strives to create the illusion of a romantic relationship between himself and the victim. In a successful deception, it 66is the fraudster’s lexical choices that cause the victim to activate the Experiencer_ focused_emotion frame in FrameNet. However, for this to occur, the Experiencer (fraudster) must convince the victim that he is in love with her and that she is the Content (object of his affection). Once she believes this, he can request money from her. In this affective relationship, which is presumably based on absolute trust, each partner has the obligation to always help each other in all situations of financial hardship. In romance scamming, the Experiencer_focused_emotion frame is thus a powerful cognitive structuring device. It is an organized package of knowledge that can easily be retrieved from long-term memory since it is already deeply embedded in Western culture and language. Since it is so deeply entrenched in our minds, the deceiver does not have to construct it. He only needs to use the right words to activate it in the victim and tap into what already exists. Using corpus methodology (Stefanowitsch, 2020), the Lexical Grammar Model (Faber & Mairal, 1999), and Dik’s (1978) stepwise lexical decomposition, we explored the linguistic means used to activate this frame. The corpus (1,045,921 tokens, and 898,377 words) was composed of 75 extended conversations between fraudsters and the author, obtained from November 2020 until April 2024. Also useful were 16 scripts provided by three ex-scammers in Nigeria, who acted as consultants on the condition of anonymity. In this study, the verbs of FEELING in the corpus, were analyzed, not only paradigmatically but also syntagmatically with an emphasis on the semantic classes of their arguments. Also meaningful was their occurrence in the fivestage process model of romance scamming in Whitty (2013ab). The WordSketch and Concordance modules of SketchEngine (Kilgarriff et al., 2014) were used to extract information regarding predicate-argument structure, the semantic classes of arguments, and collocates. The Thesaurus module helped to confirm the semantic relations between words. These verbs belong to the same frame not only because of their shared core meaning, but also because they take the same number and types of arguments or syntactic dependents. They not only have the same entailments, but also take the same participant’s perspective in the event. (Ruppenhofer, 2018). In romance scamming, verbs of FEELING are powerful because they are able to persuade the victim that she is really in a committed relationship with the fraudster. Using verbs such as like, love, adore, cherish, treasure, etc., the fraudsters speak of a ‘forever relationship’ and refer to the victim as their beautiful angel, beloved princess, or even darling wife, whom they will always love, cherish, and treasure for all eternity. These verbs are the core lexicon that the Experiencer uses to weave a beautiful tapestry of lies, which causes the Content to willingly enter a state of suspended disbelief, in which implausible situations are viewed as plausible.
Although the ‘Circular Economy’ has been widely discussed in the media for years, general dictionaries still do not provide the relevant definitions and/or collocations. We show by examining dictionary definitions that many salient words used in this field have undergone varying degrees of semantic broadening in the 252general language. Current terminological needs often dictate more precise meanings than those used in the general language, in essence leading to semantic narrowing. We discuss these two seemingly opposing forces in relation to a small set of words from the ‘Circular Economy’ and show how this sort of semantic development is easily accounted for within the Communicative Theory of Terminology.
In this article, we return to a classic lexicographical topic and address some aspects involved in the practice of defining. Digital developments have increasingly required dictionary definitions to operate independently of others if they are to be utilised in new contexts, possibly even detached from the original dictionary presentation. We examine two types of definitions where the problem is particularly obvious: morpho-semantically related words and inherited information. The first type refers to words with the same semantic core that appear in regular morphological derivations, in this context illustrated by triplets such as pluralism, pluralist, pluralistic. The second type is to do with 281senses that presuppose the definition of an earlier, usually superordinate sense, as when a culinary sense is singled out: ‘this fish used as food’. Five dictionaries are compared and analysed to uncover the tendencies and strategies followed. All five dictionaries show internal inconsistency, and while there is general consensus that the traditional paper model should be abandoned, the dictionaries do not necessarily agree on strategy and solutions.
A discussion of technical and editorial considerations in producing a 1800page hardback dictionary containing 30,000 headwords from an online database of 48,000 headwords. 248The Concise English-Irish Dictionary (CEID), published in 2020 and the first major English-Irish dictionary published in print form since the 1950s, is a 1800page hardback dictionary containing 30,000 headwords and 80,000 senses, along with a substantial style and grammar section. Flying in the face of retrodigitisation, this printed dictionary was derived from the New English-Irish Dictionary (NEID), a much larger online dictionary published 2013–2017 and containing 48,000 headwords and 120,000 senses. Simply printing the entirety of the online content would have doubled the size of the printed dictionary, so this necessitated a number of measures to whittle the online content down to a single-volume dictionary. This paper outlines some of the challenges and measures involved, such as selection or deselection of lexicographical content, reformatting for print, and the technical process of outputting the same entry to both screen and paper.
The recent development of the Curriculum for Teaching Greek as a Heritage Language: A Framework for Teachers underscored the need for a dictionary to serve as supplementary material during the curriculum’s implementation at Greek Community schools in the USA. This presentation aims to introduce the Greek Heritage Language Learners’ Illustrated Lexicon (Helix), an online, bilingual, illustrated dictionary designed for Greek Heritage Learners aged 5–10 years living in USA, Canada, Australia, etc. Helix is intended to enhance the teaching resources available to educators who teach Greek as a heritage language. It provides word definitions based on the linguistic competence of heritage speakers of Greek, facilitating efficient vocabulary acquisition and the development of students’ language skills. Specifically, it aims to serve as the primary dictionary for Greek Heritage students, whether used in the classroom or outside, to practice their reading, writing, speaking, and listening skills.
The paper presents a New Serbian-Russian dictionary and the main principles of its development. We use the most recent explanatory dictionary of Serbian, published by the Serbian Academy of Sciences and Arts in 2018, as a starting 253point. However, we refine both the word list and the entry structure to meet the requirements of a bilingual edition. We consult text corpora of modern Serbian to check frequencies and define contexts of word usage. The paper describes the criteria for word list formation, discusses the problem of lexical equivalence in closely related languages, and articulates our guidelines for representing polysemy. The dictionary will be published in both traditional print format and as a website. The digital version will be implemented using the OnLex platform; we detail its functionality and highlight the new possibilities it offers to editors and potential users.
Emotion phraseology has experienced an important surge, earning the status of a subgenre within phraseology. More recently, phraseological research has not only focused on how we talk about emotions, but also on how they are conceptualized in the speaker’s mind. A cognitive approach to categorizing phraseology, therefore, seems to provide a framework for establishing criteria that can significantly aid in addressing phraseological units (PU) about love. Aligned with this perspective, the present study aimed to investigate the conceptual metaphors contributing to the structure of the concept of love in Spanish. To do so, this research focused on a cognitive and phraseological analysis of a corpus of 48 brief texts about love in cinema. Findings evinced the multifaceted nature of love, as shown in the 18 conceptual metaphors identified, which provide insight into the diverse ways individuals experience and articulate complex feelings such as love and affection. This linguistic exploration not only deepens our understanding of love, but also demonstrates the power of metaphorical PU in capturing the complexities of human emotional world. We believe that these findings will spark contemplation regarding the pedagogical implication of love PU and will prompt the formulation of recommendations aimed at enriching metaphorical awareness within the Spanish second language classroom.
The appealing and comprehensive documentation of neologisms and the associated data represents a challenge in lexicography in many respects. Concerning dictionaries of German neologisms, special attention has frequently been paid to the linguistic information of new expressions, while less attention has been paid to discourse-related, i.e., lexical-field-related information. In the first German corpus-based online dictionary (cf. Neologism Dictionary 2006ff.), some of the headwords can be displayed according to broad thematic fields/ domains (e.g., media, communication) and sorted by decade, but the overview is limited to a list-like enumeration. No additional information is provided about the individual keywords or the semantic-lexical connection, although this often says a lot about the characteristics of a neologism and its appearance in (linguistic) reality. New trends are responsible for the creation of a bunch of new lexical items at once; as a result, words enter a language as part of a lexical-semantic group. In such a case, the inclusion of semantically neighboring units is particularly relevant for the description of the single expression. My talk deals with the description of such a group and its lexicographical representation. It intends to show that it can offer added value to build a bridge between the consideration of individual entries of headwords and the consideration of clusters of items forming a lexicalsemantic field. I will present a proposal for the visualization of related keywords taken from the future online resource for German neologisms (IDS Neo 2020+), which is currently being developed. With the increase in the use of dating platforms, the need for exchange and the lexical inventory required for this also grew: from the mid-2010s, a ‘boom’ in expressions from the world of online dating can be documented in German. From Breadcrumbing and Benching to Caspering, Exting, Ghosting, Lovebombing, Haunting, Hoovering, Orbiting, Roaching, Sneating, Stashing and Submarining – the borrowed Anglicisms (cf. Eisenberg, 2020, p. 31) linguistically cover any 36typical social behavior that is repetitively observable. At the beginning of their word careers, these expressions mostly appear in close proximity to each other. They especially occur in glossary-like lists, where the semantic concepts usually are explained. Semantically, the units of this lexical group differ from each other, sometimes substantially, sometimes only in light facets and small details, resulting in overlaps and synonymous usages. Looking at factors such as time of vocabulary entry, frequency, distribution and contextual use, the expressions form quite a heterogeneous group. When designing the new online resource IDS Neo 2020+, we set ourselves the goal of creating an interactive and dynamic presentation area in addition to the ‘classic’ dictionary, which combines semasiological and onomasiological approaches (Haß-Zumkehr, 2001; Reichmann, 1990). This concept makes it possible to go beyond the information in the individual word entries and offers space for the collective presentation of subject areas and thematic fields. In some cases, not all lexemes of a subject area can be included as headwords in the dictionary (e.g., due to insufficient evidence), but interact with the potential keywords to form a lexical-semantic network. This applies to the abovementioned group of dating terms, as some of them have quite a high distribution and frequency level (e.g., Breadcrumbing, Ghosting, Lovebombing), while others remain contextually close to their lexical peers and are less frequently used (e.g., Caspering, Gatsbying, Submarining). As the question of differentiation arises here, in particular due to the semantic proximity, we consider it to be important to include both cases in a joint presentation. Therefore, our Online Resource offers the opportunity to add even less common lexemes with concise information (e.g., paraphrase, vocabulary entry, usage examples), to get a quick overview of the whole semantic group. In addition to the lexicographic information, the group-specific characteristics will be particularly highlighted. This includes semantic intersections / distinguishing features (e.g., Is the action in question only carried out online, or is it linked to a real-life relationship? Does it take place during the relationship, afterward, or with the aim of ending it? Are two people involved or more?), positive/negative connotation, and chronological development. To let the users decide which information they want to see, the displayed content can be arranged dynamically and variably on demand. Each click generates new views, without having to leave the overview by being redirected to another page. These features are common for all presented thematic fields in IDS Neo 2020+.
This study challenges the global notion of ‘pain’ by exploring the semantic nuances of pain-related concepts in Bahasa Indonesia and comparing them to English. It examines four lexemes: sakit, nyeri, pedih, and perih using the Natural Semantic Metalanguage (NSM) and a corpus-based approach. The study proposes alternative definitions to address the circularity and vagueness in the Great Indonesian Dictionary. Analyses reveal distinctions in physical and emotional conceptualizations: physical sensations differ in body locus, intensity, and duration, while emotional pain-like concepts relate to social relationships and vary in triggers, duration, and reactions. This highlights significant cross-linguistic distinctions between Bahasa Indonesia and English in conceptualizing ‘pain’.
In an attempt to provide some background for the overall theme of the Euralex 21 conference, the talk explores the position of LLMs in the field of lexical studies from two angles, a lexicographic and a theoretical linguistic one. From a theoretical perspective, LLMs currently constitute the epitome of distributional semantics, and distributional semantics (for the reasons that I specified in Theories of Lexical Semantics, 2010) is eminently suited as a methodological basis for usage‐based cognitive semantics, allowing for a convergence of major theoretical trends in lexical semantics. But given that LLMs have taken distributional semantics well beyond the shape it had in 2010, does that evaluation still hold? For the lexicographical perspective, I will first draw attention to the too often ignored process through which lexicography not only gave a major descriptive impetus to the development of corpus linguistics, but also specifically contributed to an essential step in the emergence of computational methods for corpus research: LLMs are a tool with at least to some extent lexicographic roots. But again, given that the tool has grown well beyond its original format, how does that affect its relationship to lexicography?
When acquiring vocabulary, L2 learners must understand a word’s form, meaning, and use, which involves more than memorizing one-to-one correlations between languages. Frame Semantics (Fillmore, 1982), an approach to words as evoking a common theme or “frame,” provides language learners and teachers a way to organize vocabulary meaningfully and thus functionally. For lexicographic purposes, Frame Semantics has only been used to structure monolingual lexical databases for a variety of different languages, however it has not been applied systematically to learners’ dictionaries so far. This talk presents the G-FOL (www.coerll.utexas.edu/frames/), a beginning learner’s dictionary of German for speakers of English based on semantic frames from the English FrameNet database (Fillmore & Baker, 2010) mapped onto vocabulary used in introductory German courses at the University of Texas at Austin. Working with the original semantic frames from Berkeley FrameNet, a group of faculty and graduate students uses corpus data together with native speaker intuitions to compile a freely available frame-based learners’ dictionary for firstyear German students (Boas et al., 2016; Gemmell Hudson, 2022). The goal of this effort is to create a learning tool that allows students to systematically learn how to use words related in meaning in context, using the culturally appropriate forms. This part of the presentation focuses on the problem of taking the Berkeley FrameNet frames for English and adopting them in a simplified version for the German learner’s dictionary. We will review how the process takes the frame descriptions and adjusts them to the word lists that beginning learners of German use in the classroom. G-FOL has many affordances for language learners and instructors for vocabulary learning, because information about words on G-FOL is organized via the semantic frame they evoke. Instructors can develop units around frames, develop themed vocabulary list, and provide useful metalinguistic awareness for existing knowledge. This presentation first 39introduces Frame Semantics and FrameNet, and then presents the workflow, organization, and structure underlying G-FOL. The remainder of the presentation discusses concrete examples of the G-FOL in practice, including lexical entries based on semantic frames and annotated corpus examples, including contrastive example sentences in German and English. We will also discuss the use of authentic corpus examples, images used for illustrative purposes, and additional information regarding multiword expressions and morphology. Finally, we show how semantic frames cannot only be used to structure a learner’s dictionary of German, but it also shows how grammar can be made more accessible to students using the same semantic frames by relating them in a way that highlights its purpose and function in the target language. We will take German’s dative case and ditransitive construction as our examples, linking them to the Giving frame and the various German verbs, nouns, and adjectives that evoke it. In addition, we present a number of teaching and learning strategies that put the lexical and grammatical information provided by G-FOL at the centre of the students’ learning.
Introduction The research approach to semantic development in first language acquisition (FLA) remains predominantly enclosed in traditional lexicographic terms and notions (usually simplified). This viewpoint doesn’t adequately span the lexical system’s complexity and fails to present the mechanisms and processes involved in its development. Since FrameNet possesses standardized methods and procedures for the sufficient description of lexical system, we suggest it as a research method in studies of semantic development. Research subject and goal We explore FrameNet as a tool for detecting and tracking changes occurring during semantic development in FLA. The subject of the analysis is a Serbian LU “uspomena” and its usage in essays of elementary school students aged 9 to 15. This unit suits LU “memory.n“, frame: Remembering_experience in FrameNet. According to the Dictionary of the Serbian Language (Nikolić, 2011), the lexeme “uspomena” consists of two lexical units: 1. ‘a memory of someone or something, an impression of remembering someone or something’; 2. ‘an item that reminds of someone or something, which is kept in memory.’’ Using FrameNet, we also explore whether these LUs are split in the linguistic production of elementary school students in Serbia. 59Language resources Instances of target word usage were excerpted from four different corpora: (1) Developmental Corpus of Serbian Written Language (RAKOPS), (2) Referent Corpus of Contemporary Serbian (SrpKor2013 and SrpKor2021), (3) Serbian Web Corpus (SrWac), and (4) SrpELTeC (Serbian part of the European Literary Text Collection). Method Examples of target word usage extracted from corpora have been manually annotated for FEs and PTs by applying methods and procedures described in Ruppenhofer et al. (2016). The first step involved establishing FrameNet for the Serbian language and building its support.1 Next, FEs and PTs were established according to the target word usage in both referent corpora, web corpus, and ElTeC. This way, the reference point for comparing the usage of analyzed LU in school-age children’s written production was provided. Afterward, extracted LU instances from the RAKOPS were annotated for FEs and PTs. Lastly, established FEs and PTs were compared within RAKOPS, according to students’ grade level, and between RAKOPS and referent corpora. Some results Differences in FEs and PTs of analyzed LU between the usage in the RAKOPS and referent corpora were determined. A core element STATE isn’t confirmed in examples excerpted from the RAKOPS, while three FEs from another LU (“recollection.n”) are found (grey zone in Table 1). This shows that the demarcation line between the two senses of the lemma “uspomena” isn’t fully established, and that sense blending in part of peripheral and extra-thematic FEs occurs at this developmental stage. Also, differences in FEs and PTs according to students’ grade level were determined (Table 1). 1 First attempts are described in Marković et al. (2021). 6061 Table 1: FEs and PTs of non-standard usage examples in RAKOPS according to grade level Grade LU “uspomena,n. ” (Eng. memory,n.) FE and PT Cog- nizer Exp Imp Salient_ ent Context Man- ner Content Place Exp lan 3 CNI Kao-što. Sub AJP PP[u].Loc Kada. Sub Jer. Sub Np.Nom Srel Poss.Det PP[na].Dep AVP 4 Poss.Dat PP[u].Loc CNI AJP PP[u].Loc Poss.Dat Srel PP[u].Loc Poss.Nom PP[na].Loc Np.Nom AVP Poss.Dat Poss.Nom PP[u].Loc 5 Poss.Nom AVP CNI PP[u].Loc CNI AJP PP[u].Loc Poss.Det AVP Poss.Det AJP Srel PP[u].Loc 6 Poss.Dat PP[iz].Gen AVP Poss.Dat PP[u].Loc CNI AJP AVP CNI Kao-npr. NP AJP AVP CNI AJP CNI AVP / PP[u].Loc Poss.Dat AVP 7 CNI AVP Poss.Dat AVP / NP.nom PP[u].Loc CNI PP[od]. Gen. AJP 8 Poss.Dat AVP CNI PP[sa]. Ins. AJP CNI AVP CNI PP[iz].GenConclusion In the traditional approach, determined differences in the usage of the analyzed LU would be considered “lexical errors.” Nonetheless, the application of FrameNet enabled insightful findings that reveal the course of meaning development and sense splitting rather than a vacuous deviation from the standard usage.
In this paper, we report on our development of a multi-level analysis framework that allows us to assess AI-generated lexicographic texts on both a quantitative and qualitative level and compare them with human-written texts. We approach this problem through a systematic and fine-grained evaluation, using dictionary 254articles created by human subjects with the help of ChatGPT as an example. The levels of our framework concern the assessment of individual entries, a comparison with existing dictionary entries written by experts, an analysis of the writing experiment, and the discussion of AI-specific aspects. For the first level, we propose an elaborate evaluation grid that enables a fine-grained comparison of dictionary entries. While this grid has been developed for a specific writing experiment, it can be adapted by metalexicographical experts for the evaluation of all kinds of dictionary entries and all kinds of dictionary information categories.
In current public discourse and debates, new expressions (or new meanings) are constantly emerging. One of the results during this process is the creation of new synonymous relations between two new words or between neologisms and already established lexemes. Sometimes, such as during the COVID pandemic, there were also veritable “synonym pushes” (Harm, 2022) in German (e.g., Covid(-19) / Corona / Sars-CoV-2, Booster(impfung) / Auffrischungsimpfung or Lockdown / Shutdown) (cf. Storjohann & Pawels, 2023). In the course of lexicalization and competition between neologistic (near-)synonyms as well as between new terms and well-established lexemes, various scenarios can occur: These include processes leading to the decline in use or lexical loss in which, for example, a lexeme disappears again or only plays an extremely marginal role. Another scenario would be “peaceful coexistence” where lexical rivalry does not seem to play a major role and the (near-)synonymous terms remain equal members of the vocabulary. Linguistic rivalry can also lead to semantic shifts, e.g., semantic differentiation, sense broadening or narrowing etc. In this context, Aronoff and Lindsay also speak of (semantic) niches (cf. Lindsay & Aronoff, 2013; Aronoff, 2016); a niche for a word is “a clearly defined subdomain within its potential domain” (Lindsay & Aronoff, 2013, p. 135). However, according to Dalmas et al. (2015), (near-)synonyms can also be distinguished by parameters other than semantic aspects of the lexemes in question. For example, they talk about the relevance of the thematic domain and discourse practice, which play a decisive role in the choice of one or the other lexeme. These parameters should therefore also be considered in an analysis of newly emerging meaning equivalents. Another factor that has potential effects on lexical competition between neologistic (near-)synonyms is linguistic/lexical doubt if attested regularly (cf. Klein, 2003, 2009 and 2018). According to Klein (2003, p. 7, emphasis in original), a case of linguistic doubt can be defined as follows: “Ein sprachlicher Zweifelsfall (Zf) ist eine sprachliche Einheit (Wort/ Wortform/Satz), bei der kompetente Sprecher (a.) im Blick auf (mindestens) zwei Varianten (a, b…) in Zweifel geraten (b.) können, welche der beiden Formen (standardsprachlich) (c.) korrekt ist […]” (“A case of linguistic doubt is a linguistic unit (word/word form/sentence) where competent speakers (a.) may be in doubt (b.) with regard to (at least) two variants (a, b...) as to which of the two forms is correct (in terms of standard language) (c.).” (own translation)
Indications of cases of linguistic doubt can be found, for example, in online forums or in meta-linguistic reflections in DeReKo such as the following:
(1) Statt ‚boostern‘ sollte es ‚boosten‘ heißen, findet Leser Sch. In der Tat sagt man ja auch ‚fighten‘ und nicht ‚fightern‘, obwohl es im Englischen den dem booster vergleichbaren fighter gäbe.“ (Süddeutsche Zeitung, 31.12.2021, p. 14, Sprachlabor) This citation shows that words like boostern, a variant of boosten, which do not correspond to any known pattern, can also cause linguistic uncertainty. After theoretical remarks about neologisms, synonyms and linguistic doubt, in this paper, I will show a few examples of a corpus-linguistic analysis of German (near-)synonyms (some of which cause doubt) using DeReKo and its analysing tool COSMAS II. These pairs/groups include mainly neologisms but also some words which are already established in German. The focus lies on a microdiachronic analysis (of the changes) of co-occurrences, frequency and diffusion, covering an observation period from 2010 up to and including 2023. The (near-)synonyms are taken from the IDS neologism dictionary (Neologismenwörterbuch, 2006ff.). The corpus analysis is supplemented with illustrative frequency graphs generated by OWIDplusLIVE, a tool, which provides insights into the statistical development of terms, therefore well suited for the direct comparison of two or more lexemes. If required, the tool allows an analysis updated daily (reference time: previous day) on the basis of various RSS news feeds from German online press since the beginning of 2020 (cf. Michaelis, 2023, p. 186ff.). Building on the linguistictheoretical analysis of neologistic (near-)synonyms, I will turn to the question of how to incorporate the findings into an online dictionary of neologistic (near-) synonyms that is currently under creation and which is able to answer questions as raised on online forums. Among other things, the focus will be on what would help dictionary users who are currently faced with a case of new lexical doubt in order to satisfy their consultation needs and to be able to decide on the more appropriate variant in a particular communicative situation.
Some (prescriptive) dictionaries do not include recently borrowed lexemes, while other descriptive ones treat them like older words or (‘native’) neologisms formed within the given language. The question of inclusion/exclusion is especially relevant in cases where a ‘native’ neologism in a language and a newly borrowed word are in fact (near)-synonyms; compare, for example, German downloaden – herunterladen (1990s), Gendergap – Geschlechtergefälle (2000s), Wallbox – Ladestation (2010s), and Microgreens – Mikrogrün (2020s). In our paper, we present a study on the preference of such neologistic synonym pairs and build on previous studies on the acceptance of neologisms. Here, we hypothesized that borrowed neologisms are accepted similarly well by the speech community as other ‘native’ neologisms, e.g., new lexical units created by word formation or loan translations. Both types of neologisms are not characterized by a specific semantic relationship between them. A first corpus-based study (KlosaKückelhaus & Wolfer, 2020) focused on frequency developments of neologisms and on the use of pragmatic markers (e.g., quotation marks) with these words as indicators for acceptance in DeReKo, a German reference corpus. These results were contrasted with those from a psycholinguistic experimental paradigm (Wolfer & Klosa-Kückelhaus, 2023) that allowed us to estimate the degree of uncertainty of the participants based on the mouse trajectories of participants’ responses. While the first study indicated a clear difference between borrowed and ‘native’ neologisms (highly frequent neologisms in general, as well as German neologisms, are marked less often) with diminishing differences over time, the results of the second study suggested (unexpectedly) that Anglo-neologisms are accepted more frequently, more quickly, and with less uncertainty than the ‘native’ ones. These effects, however, were restricted to participants born after 1980. The present study expands on these findings by collecting data in a questionnaire study employing a two-alternative forced-choice paradigm accompanied by supplementary contrastive corpus examinations. This time, the subjects of investigation were different neologisms holding a synonymous relation. Overall, we examine forty pairs of (near-)synonyms from four decades (1990s to 2020s) and a range of subject areas (e.g., technology, economics). The pairs are embedded in a stimulus sentence providing some context for their use. This stimulus sentence is preceded by a context sentence which further clarifies the communicative situation. This contextual scenario itself varies between formal and informal. Hence, for each choice between a ‘native’ neologism and its sense equivalent (a borrowed word), we can investigate the influence of the decade when a word first appeared in the German language (as documented by the German Neologismenwörterbuch (2006ff.)) and the communicative setting (formal vs. informal). Several covariates that might modulate these effects are available via the personal data provided by the participants. The questionnaire results will be juxtaposed with intrinsic characteristics of the words, such as corpus frequency or subject area. To discern usage preferences and potential distinctions, Cruse (2004) suggests examining the selectional and collocational profiles. This approach, grounded in usage, highlights syntagmatic relations in particular communicative situations. A subsequent corpus-based analysis will try to determine ‘contrastive’ contexts, i.e., contexts that tend to favor one lexeme over the other. On the other hand, we also aim to identify (types of) contexts in which both lexemes co-occur (see example (1)). 1) Als private Strom-Tankstelle empfiehlt sich stattdessen die Installation einer sogenannten Wallbox, die an der Garagenwand festmontiert wird. Die privaten Ladestationen gibt es in verschiedenen Leistungsstufen. (DeReKo, Die Rheinpfalz, 01.03.2019) (Instead, as a private power station, we recommend installing a so-called wallbox, which is permanently mounted on the garage wall. The private charging stations are available with different power levels.) All analyses will be based on DeReKo. It is argued that corpus-based examinations of the specified synonyms enhance the insights obtained from the questionnaire. Both, data collection and interpretation as well as corpus analyses are currently still under way. It is, unfortunately, too soon to present any conclusive results here. However, we expect the results to be available and analyses to be completed well before the conference. Given our previous studies, we expect borrowed neologisms to be preferred among younger participants proficient in English, presumably even more so in informal contexts – this should also align with frequency of occurrence in reference corpora. We also posit that discernible 119tendencies emerge over time in terms of established preferences regarding the selection of native versus non-native terms. The findings not only offer understanding of lexicological matters related to the equivalence of meaning in newly coined and/or borrowed words but also address diverse lexicographic inquiries (cf. Hahn, 2004; Storjohann & Pawels, 2023). Consequently, we will propose alternatives for comparative lexicographic documentations, adopting an approach where entries resemble a “discriminating synonymy based on contextual analysis” (Hausmann, 1990, p. 1071).
The European Network on Lexical Innovation (ENEOLI, CA22126 – www.cost. eu/actions/CA22126/, October 2023 – October 2027) is a COST Action seeking to address the lack of comprehensive, multilingual, and globally focused research on neology. As of July 2024, 252 members from 48 different countries have been participating in the Action. The main goal is to establish a network of researchers across Europe and internationally to share best practices and methodologies for studying and documenting lexical innovation. This effort centres on neology, an essential but frequently overlooked aspect of natural language analysis. Linguistically, examining neology improves the comprehension of a language’s lexical system and its development. Furthermore, from a broader perspective, documenting neologisms sheds light on the material and social aspects of language communities. ENEOLI tackles several key challenges: (1) Defining and elucidating fundamental concepts in lexical innovation by developing an open-access multilingual glossary; (2) Reviewing and disseminating methodological implementations, digital resources, and tools for identifying and tracking neologisms over time using both natural language processing and sociolinguistic/psycholinguistic methods; (3) Conducting comparative studies of neologisms across European languages, examining factors such as technological impacts on lexical innovation and contactinduced language change in the digital age; and (4) Offering specialised neology training for professionals like translators, teachers, technical communicators, and terminologists. ENEOLI aims to advance neology through a comprehensive approach: (1) Creating a digital open-access multilingual glossary to define and illustrate the terminology related to lexical innovation; (2) Developing and refining methods, digital resources, and tools for neology research; (3) Performing diachronic and synchronic comparative studies of neology; and (4) Innovating neology training programs. To accomplish these objectives, the Action is structured into four work packages: WG1: Multilingual glossary of neology; WG2: Methods, digital resources, and tools for neology; WG3: Diachronic and synchronic comparative studies of neology; and WG4: Training in Neology. The first work package, WG1, focuses on creating a comprehensive multilingual glossary that defines and elucidates fundamental concepts in lexical innovation. This open-access resource, groundbreaking in many aspects, will serve as a crucial reference for researchers and professionals in various fields, offering clear definitions and examples of neologisms. The glossary is designed to be dynamic, continuously updated to reflect ongoing linguistic and technological developments, and to promote global intercultural understanding and cooperation. WG1 is currently engaged in a multifaceted project aimed at creating the born-digital multilingual glossary of neology terms. The initial phase involves the compilation and analysis of a French corpus (Lambert-Lucas corpus, 2024) provided by Lambert-Lucas Editions, utilizing monographs from the collection La Lexicothèque. Tools such as AntConc24 and Sketch Engine25 have been instrumental in extracting term candidates from this corpus. The methodology developed from this pilot study has informed subsequent tasks. Initially, keywords, monolexical and polylexical terms, and N-grams underwent a rigorous validation process involving a panel of experts. This collaborative review has already resulted in an initial list of over 100 terms, now ready to be included in a multilingual neology vocabulary called NeoVoc. Concurrently, efforts are being expanded to enrich NeoCorpus26, a collection of books and articles on neology in various languages, managed within a Zotero group. This corpus will enhance the metadata and allow the creation of a vocabulary on neology, NeoVoc, in Wikibase. This will facilitate the inclusion of English equivalents and terms from other languages, paving the way for a comprehensive glossary that meets the needs of scholars, lexicographers, students and neology enthusiasts globally. WG2 is dedicated to developing and refining methods, digital resources, and tools for neology research. This includes leveraging natural language processing (NLP) technologies to identify and track neologisms, as well as employing sociolinguistic and psycholinguistic methods to understand their usage and dissemination. By reviewing and disseminating the best practices and 24 https://www.laurenceanthony.net/software/antconc/ 25 https://www.sketchengine.eu/ 26 https://www.zotero.org/groups/5449136/neocorpus 210methodologies, WG2 ensures that researchers can access the most effective tools for studying lexical innovation. Additionally, WG2 also looks at how these tools and methods inform the lexicographic treatment of lexical innovation, with a focus on the editorial decisions around the inclusion and description of neologisms in dictionaries and other lexical resources. WG3 conducts diachronic and synchronic comparative studies of neology across various European languages. These studies will explore how social, cultural, political and technological evolutions impact lexical innovation and how language contact in the digital age induces changes in vocabulary. Comparative research will provide valuable insights into the similarities and differences in lexical innovation among different languages and cultures, contributing to a deeper understanding of neology. WG4 focuses on providing specialised training in neology for professionals such as translators, teachers, technical communicators, and terminologists. This training is essential to ensure that these professionals can effectively apply knowledge of neologisms in their respective fields, promoting better understanding and use of new terms. Additionally, the training programs aim to integrate research findings into professional practices, enhancing the overall impact and relevance of neology research. To support the efforts of the four working groups and to transfer know-how to younger researchers, we will launch an annual training school, with the first session in 2025, focusing on neology and lexical innovation. The primary aim is to deepen knowledge in these areas and their related theoretical and applied fields. This talk presents an overview of the ENEOLI project27, detailing the goals of each work package, ongoing activities, and emphasizing its lexicographic components. The project represents a significant collaborative effort to advance the study and practice of lexical innovation, promoting a deeper and more inclusive understanding of linguistic evolution in a multilingual and global context. Through its comprehensive and innovative approach, ENEOLI aims not only to fill gaps in current knowledge but also to create lasting and valuable resources for researchers, professionals, and language communities worldwide. To achieve these aims, we also need to train individuals to improve the quality of neology and lexical innovation projects, thereby strengthening the global community.
In 2023, the Institute of the Estonian Language, in collaboration with the Center for Applied Anthropology of Estonia, conducted a user experience survey aimed at understanding the habits, needs, and attitudes of users of the language portal Sõnaveeb (‘Word Web’) and preparing for the publication of the Dictionary of Standard Estonian (DSE) in 2025. This paper addresses prescriptive and descriptive issues in Estonian lexicography, including controversial meanings. It 264provides an overview of the user experience survey, detailing the methodology of the online survey and the subsequent qualitative analysis of responses. The findings from the survey are presented and discussed, revealing diverse attitudes toward language among users. While some dictionary users only seek information about language, others aim to enrich their language use, and a third group seeks correctness and guidelines from language planning, including ‘correct’ meanings of words. The contemporary linguistic approach is descriptive rather than normative and this has been adopted in the EKI Combined Dictionary (available via the language portal Sõnaveeb). However, the legal norm of Standard Estonian is still determined (among other sources) by the latest Dictionary of Standard Estonian (DSE). The existence of these two separate sources has been causing confusion among dictionary users.
It may seem obvious to state that tracing the history of a language involves consulting lexicographical works of all kinds, but the truth is that specialized lexicographical compilations, i.e., those referring to the specialized languages of a particular field of knowledge, have not always been duly considered in the diachronic study of language. In this contribution, we aim to present a computerbased open-access research tool designed to assist researchers in tracing the history of the medical lexicon and, therefore, the evolution of medicine within the Spanish-speaking context. This tool is known as Tesoro Lexicográfico Médico (Medical Lexicographical Thesaurus; see Gutiérrez-Rodilla, 2024) and stands as the first of its kind created for a scientific, specialized field in Spanish. In our presentation, we will outline the phases into which the project was divided and detail some of the results we have achieved to date. In the initial phase, it was essential to first locate and then familiarize ourselves with the dictionaries generated and published in Spain within the medical domain during a specific timeframe. The group of nine researchers leading this initiative has been dedicated for years to delving into the historical and lexicographical background of the medical domain in the Spanish language, placing us in a privileged position to successfully complete the project. The chosen period is linked to a historical juncture that marked a pivotal development in the evolution of specialized lexicography across the European continent, and consequently, in Spain – specifically, the 18th and 19th centuries, along with the early years of the 20th century. Indeed, as is widely acknowledged, this constitutes the significant Achilles heel in the history of metalexicography: our lack of knowledge of how many and which specialized dictionaries were produced in the past. This is certainly regrettable, considering the wealth of valuable information that such dictionaries invariably yield. As McConchie (2014, n. p.) puts it: “dictionaries themselves and those who compiled them remain largely in the outer darkness. (…) [T]he whole area remains a goldmine of rich research pickings.” This gap is gradually being filled in some areas. In the case of medicine, our research group has opened in the last years a line of research that aimed to account for all the existing repertoires of lexicographical interest in the medical field in Spanish and then proceeded to study them (Gutiérrez-Rodilla, 1999; 2017, for instance). During this phase of searching, locating, and compiling medical dictionaries, we also carried out a systematic classification of the identified works based on the following parameters. On the one hand, lexicographic specialized works can be categorized into terminological and encyclopaedic dictionaries. The former have been referred to as ‘word dictionaries’, ‘lexicons’, and ‘vocabularies’. This category of lexicographic works encountered limited success in France but thrived in Spain, addressing a clear need to name the emerging concepts and theories developed north of the Pyrenees. Another criterion for classifying lexicographic works from this period is the specialized subject matter they encompassed. Thus, medical reference works can be divided into general dictionaries of medicine and dictionaries focused on a highly specific medical field, such as Therapeutics or Symptomatology. Another consideration is the intended audience: some dictionaries were tailored for specialists, while others targeted the general public. A final criterion is whether the dictionary was originally composed in Spanish or translated and/or adapted from another language. In a second phase, it became imperative to design the computer-based tool to collect the selected Spanish medical dictionaries, taking into consideration the requirements expressed by researchers from various fields of knowledge, including the history of medicine, the history of specialized languages, the diachronic study of medical Spanish or specialized translation, among many others. In our presentation, we will provide an overview of the user interface and elucidate how the tool has been conceived. Lastly, the current phase, the third one in which we are presently immersed, focuses on enriching the tool with the diverse aforementioned medical dictionaries. As of now, six terminological medical dictionaries originally written in Spanish between 1730 and 1886 have been incorporated for consultation, including more than 60,000 lemmas (such as Suárez de Ribera, 1730; Hurtado de Mendoza, 1840; or Vázquez de Quevedo, 1852). In the last part of the presentation, we will highlight some of the benefits that this tool offers to researchers interested in the study of medical language or specialized language in Spanish. All in all, we will provide examples from recent 94 publications by members of the group and discuss potential future applications of this valuable tool.
The paper presents a project devised by Georgian and Hungarian lexicographers which aims at improving dictionary use skills and dictionary culture in Georgia and Hungary. The project is based on previous experience, studies and findings of its authors at Ilia State University (Georgia) and Károli Gáspár University of the Reformed Church in Hungary. The feedback gathered from theoretical lexicography courses and the needs of the students emerging from these courses revealed the necessity to concentrate more on practical issues of teaching dictionary skills. The cross-border cooperation project will be divided into two stages. At the initial stage, the shortcomings in dictionary skills among students, as well as the special needs of students and teachers will be identified with the help of a questionnaire, supplemented by interviews and tests to refine the data. The results of the survey will be used in the next stage for the development of teaching materials, which will include a workbook (in print and e-book format), a teacher’s book (in print and e-book format), as well as a variety of online tools and exercises for language learners which will help them explore reference skills from many different angles and in different situations.
About DANTE DANTE (Database of Analysed Texts of English) was initially developed in the years 2008–2010 (Atkins, Kilgarriff & Rundell, 2010) by a lexicographic team led by Sue Atkins, Adam Kilgarriff, Valerie Grundy and Michael Rundell. It was commissioned by Foras na Gaeilge, a governmental agency promoting the use of Irish language, for the purposes of the development of the New English Irish Dictionary (Mianáin & Convery, 2014). It was produced on the basis of an English corpus having about 1.7 billion words using the Sketch Engine toolchain (Kilgarriff et al., 2014). DANTE provides a very detailed lexicographic analysis of about 50,000 single word English entries as well as 45,000 compounds, with lexical units subject to the following structure:
- wordclass
- secondary grammar (inherent properties of headword)
- informal definitions
- syntactic constructions and arguments of the headword
- lexical collocates based on corpus frequencies
- support verb constructions
- support prepositions domain /subject field
- regional variety
- speaker/writer attitude
- time
- register
- style
- full example sentences (from corpus)
- variant forms
- derived forms
- cross-reference.
DANTE Resurrected
DANTE was originally released on the IDM DPS platform7 and until 2023 it was a closed proprietary product of Foras na Gaeilge. In this paper we present new development following the decision made by Foras na Gaeilge to release the content of DANTE under the terms of the CC-BY 4.0 open source license.8 We show a new dedicated web interface for DANTE based on the Lexonomy dictionary platform (Mĕchura et al., 2017; Rambousek et al., 2021) hosted at anonymized which interlinks DANTE with additional corpus based resources and discuss future uses of DANTE for the purposes of research in lexicography. We particularly focus on using DANTE for evaluation of automatic dictionary drafting, including by using large language models, and the full paper will provide an experimental evaluation of these methods based on DANTE data.
Conclusions
The 2010 paper on DANTE ends with the following description: “DANTE is a lexicographic project where the end-product is not a dictionary but an in-depth analysis to be used for creating one or more dictionaries. The users of DANTE are not the dictionary-using public but the lexicographic teams who will take this on to dictionary status.” We believe that by open sourcing DANTE, Foras na Gaeilge made an important step towards the goals as initially envisaged and that DANTE is an important and welcome contribution to the international lexicographic community.
Introduction In 1911, Berlin missionary Karl Heinrich Julius Endemann, published his dictionary of the Sotho language Wörterbuch der Sotho Sprache, 1911. This dictionary faced scholarly neglect due to its rare combination of source and target languages, i.e., Sotho and German respectively, and also its missionary focus. Obsolete orthography, high user skill demands, and a lack of alignment with modern lexicographic principles contributed to its marginalization. This paper re-evaluates the dictionary within the context of bilingual Sepedi dictionaries, emphasizing historical and cultural aspects rather than a contemporary comparison. It explores macro- and microstructures of the dictionary, assesses accessibility to modern users, and proposes digitization strategies for improved usability, envisioning a multiphase approach with varied electronic features. Macro- and microstructures of Endemann (1911) Kosch (2011) criticizes Endemann’s Sotho language dictionary for disregarding some good lexicographic principles in the compilation of the dictionary. In our reassessment within the context of other bilingual Sepedi dictionaries, we focus on key issues: treatment of grammatical formatives, alphabetical categories, highfrequency lemmas, semantically related paradigms, and lemmas with cultural significance. By comparing Endemann’s dictionary with selected Sepedi reference works that span almost five decades (1967 to 2015), a balanced perspective on its lexicographic value and utility is established. • Treatment of grammatical formatives Grammatical formatives are usually notoriously undertreated in bilingual dictionaries in which the source language is a Bantu language, since these formatives are typically not carriers of lexical meaning. Our investigation shows that Endemann’s treatment of these formatives matches and, in some instances, even surpasses their treatment in modern Sepedi dictionaries.
• Alphabetical categories
Data collection carried out from 1861 to 1873 was not meant expressly for lexicography. Still, it will be demonstrated that Endemann avoided common pitfalls, where even contemporary lexicographers face challenges like the overtreatment of alphabetical categories.
• High-frequency lemmas
Inclusion of lemmas based on their frequency of use is a feature of modern corpus-based lexicography. Even so, experiments done, show that generally speaking, Endemann’s dictionary compares very well with existing Sepedi dictionaries with regard to lemmatization of high frequency items.
• Semantically related paradigms
Lexical sets, as defined by Atkins and Rundell (2008), are groups of words sharing a common element of meaning, often rooted in sense relations like synonymy or hyponymy. Using the days of the week as a prototypical example, the study investigates Endemann’s awareness of completing semantically related paradigms. Surprisingly, all selected Sepedi dictionaries lemmatize weekdays, except for Endemann (1911), raising questions about conceptual differences in the Sepedi-speaking community during his data collection from 1861 to 1873. The absence of the word ‘week’ and likely lack of standardized weekday names before 1930 add historical context to Endemann’s compilation challenges.
• Lemmas with cultural significance
Endemann’s lexicographic approach is marked by rich sense distinctions and detailed definitions, particularly concerning culturally-bound lemmas. The detailed treatment of such culturally significant lemmas contributes to Endemann’s dictionary as a valuable and distinctive resource that needs to be preserved and made accessible to new generation users.
Accessibility to modern users
Despite its favourable comparison with existing Sepedi dictionaries, users’ accessibility to Endemann’s dictionary is hindered by various factors. The dictionary has been out of hard-copy print for over a century, while the e-version is prohibitively expensive. The publisher, De Gruyter Mouton (Verlag) however granted us their permission to digitize and publish a good part of the dictionary for free online use. The main challenge lies in Endemann’s orthography and the unconventional ordering of alphabetical categories, detailed in German only in the introduction, demanding a deep understanding of phonetics for effective use.
Digitization strategies for enhanced usability
In the last section of our paper, we investigate digitization of the dictionary on various levels of complexity and sophistication, and also indicate which resources are necessary for these different levels of digitization. Digitization will include the creation of a manual gold standard and in parallel, deployment of OCR4all (Reul et al., 2019). Initial results of the OCR process were surprisingly good, especially when considering the numerous diacritical signs found in the dictionary: an accuracy figure of 99.93% was obtained, as calculated by OCR4all. The basis for this calculation is the comparison of 10 manually transliterated pages with the predictions of the OCR model for these pages. The accuracy improved iteratively with each training. The overall aim is to determine to what extent the challenges outlined in the preceding sections can be addressed by means of selective digitization strategies in order to make the dictionary accessible to modern day users.
Spoken language is the prerequisite of written standard languages for living language communities. Yet written sources dominate lexicographic description of standard languages, and awareness of dictionaries that specifically source speech seems limited. In Norsk Ordbok (The Norwegian Dictionary), and in the Language Collections on which the dictionary is based, oral materials are perceived as input to the national language Nynorsk, written or spoken. The purpose is integration into 275one whole, not a series of parallel lexical registers. Legitimacy is aimed at by explicit sourcing of linguistic information, whether from speech or literature. This paper looks at how speech is sourced within the entries of the dictionary Norsk Ordbok (NO), particularly at the sourcing of definitions. Explicit sourcing of speech in connection with definitions facilitates investigating the contribution of speech materials to Norsk Ordbok as a whole, and if and how the differences between speech and written text is reflected in Norsk Ordbok.
Czech Dictionary Express has been introduced as a project of a semiautomatically made dictionary of the Czech language. The Dictionary Express method (formerly known as rapid dictionaries) has been used for several different languages. In this paper, we analyse the automatic and manual tools used in Czech Dictionary Express and inspect the statistical and qualitative data such tools provide. As the first task of the project – the headword annotation – comes to an end, we examine some opportunities and difficulties of the method used, as well as the data acquired in the process.
The objective of this paper is to illustrate, through the examination of sample entries, the methodology employed in the creation of a prospective pilot corpusbased dictionary of Serbian as a second language, drawing on advancements applied in other similar projects for different languages (e.g., François et al., 2014; François et al., 2016; Klemen et al., 2023). While Serbian is spoken as the official language across the entire territory of the Republic of Serbia, in specific municipalities with a substantial population of national minorities, languages other than Serbian are officially recognized and spoken as native languages. In municipalities where the majority of the population belongs to a national minority, the educational system is conducted in the language of that minority. In such cases, Serbian is taught as a second language, with a curriculum comprising 90 minutes of classes per week. These language classes follow two distinct formats, contingent upon the possibility of interaction that young members of national minorities have with native Serbian speakers. Programs A and B are designed to take these differences into account (Krajišnik & Strižak, 2018). Program A is tailored for members of national minorities residing in homogeneous environments, where students lack direct contact with the Serbian language. In these cases, Serbian is treated and taught as a foreign language. Conversely, program B is aimed at members of national minorities living in heterogeneous environments, where they are consistently exposed to native Serbian speakers and possess an intermediate or high level of competency in Serbian even at the elementary school level. Additionally, there are guidelines to distinguish program C from program B (Redli, 2023). Program C would be intended for members of Croatian and Bosniak minorities with near-native proficiency in Serbian. This described dictionary is intended for young students learning Serbian as a second language, specifically in accordance with programs A and B, with the exclusion of the final category of speakers in program B (designated as proposed program C). The outlined methodology comprises three key phases. Firstly, the compilation of a receptive electronic corpus of Serbian as a second language (SrbL2Cor 1.0), derived from 24 textbooks used in elementary schools across two publishers. This corpus is stored in the ParCoLab database (Miletic et al., 2017) in XML format, adhering to the TEI P5 Guidelines, but access is restricted due to ongoing copyright negotiations. Additionally, the corpus is lemmatised, morpho-syntactically annotated, and syntactically parsed using Serbian language resources developed within the ParCoLab project. Secondly, the selection of vocabulary lists for lexicographic processing, that entails both automatic extraction from the SrbL2Cor 1.0 Corpus and the manual revision of the extracted lists compared to official non-corpus-based lists recommended for Serbian as a second language program creation (Krajišnik & Dognar, 2018; 2019). The third phase involves establishing a pilot XML dictionary database that will incorporate 500 lexicographically processed lexical items from the specified vocabulary lists. The lexicographic processing includes, besides the lemma entry, inflection data, usage labels, senses with their indicators, native language equivalents for intended dictionary users, and typical syntactic behavior demonstrated through slightly modified corpus examples. Upon project ompletion, part of this database will be accessible for free consultation in the updated multilingual dictionary module of the Serbian verb conjugator, SerboVerb (Marjanović, 2023), developed in conjunction with the ParCoLab project. The paper compares a novel corpus-based processing method, with the article structure informed by pedagogical considerations (e.g., Jelaska, 2005; Krajišnik, 2011), against the approaches found in a few existing yet outdated and scarce bilingual and monolingual paper dictionaries for Serbian as a second language (cf. Jerković & Perinac, 1980; Vasić & Jocić, 1988–1989; Ajdžanović et al., 2016), highlighting its advantages. The innovation in the lexicographic processing of Serbian in this dictionary project involves manual annotating lemmata, their senses and examples using CEFR labels, relying on corpus data rather than intuition. This allows the extraction of specific corpus-based dictionaries tailored to corresponding CEFR levels and the primary target group. In Program A, the distinction is drawn between levels A1 and A2, guided by the frequency of occurrence in the SrbL2Cor 1.0 and lexical relevance for a specific age group of students, as stipulated by official, not corpus-based vocabulary lists (Krajišnik & Dognar, 2018; 2019). Simultaneously, in Program B, levels A2, B1, and B2 are differentiated using the same criteria.
The SemETAP semantic model is a part of a more general ETAP linguistic processor aiming at analyzing and generating NL texts. The task of SemETAP is building two kinds of semantic structures – Basic SemS, which capture the core meaning of the sentence, and Enhanced SemS, which contain diverse inferences drawn from Basic SemS. SemETAP is supported by two main lexical resources – a combinatory dictionary of Russian and an ontology. One important requirement for the SemS is that it should explicitly represent all semantic arguments of the predicates of the sentence, expressed by all kinds of words – verbs, nouns, adjectives or adverbs. We discuss the argument structure of ordinal adjectives (first, second,…, last, next), which has been largely neglected in the literature on valency and arguments. Several semantic slots are introduced for ordinal adjectives: hasObject, hasObject2, belongsTo, hasNumber, hasStartingPoint, hasTerminalPoint, orderedBy. Our analysis reveals interesting features in the behavior of the arguments of ordinal adjectives.
The paper outlines one of the results of the project dedicated to one of the endangered Kartvelian languages, especially Megrelian. Providing data collection and documentation through fieldwork implemented in Samegrelo (Georgia), the project aims to comprehensively document the Megrelian language and encompasses the development of the annotated corpus, sketch grammar, and a bilingual dictionary. As a result, a bilingual Megrelian-English dictionary has been compiled using the Fieldwork Language Explorer (FLeX) and combining technological and traditional lexicographic approaches. We provide numerical examples to highlight the language structure and its application to the compilation of the dictionary, discussing its application to language preservation issues. The paper is subdivided into four parts: 1. Introduction, which outlines the project dedicated to the documentation of Megrelian language within the framework of the project financed by the Shota Rustaveli National Science Foundation (FR-21-993-3, 2021–2025); 2. Lexicographic insights on the Megrelian-English dictionary, which highlights the challenges of preserving endangered Megrelian language; 3. Macro- and micro-structures of the Megrelian-English dictionary, which emphasizes the structure of the dictionary compiled using FLeX and provides information on its licensing and accessibility; 4. Conclusions underscore the importance of this lexicographic effort and its application to the preserving of Megrelian language.
We intend to create an online dictionary for manufacturing technology in the automotive industry, which will be available in German and Chinese. The dictionary is designed to improve specialized communication, and its target users include Chinese engineers, students, and interpreters in this sector. The development of entry structures using Frame-Based Terminology (FBT) as a dynamic processoriented approach will be the main topic of this paper. Specifically, we will present a corpus-based methodology for integrating contextual information, especially syntactic-semantic features, into the entries. At this point, we start with the description of predicative terms. It might be conceivable to provide a semantic-based approach that can serve as a practical solution for future bilingual dictionaries in various engineering disciplines.
The impact of artificial intelligence on language learning tools and specifically dictionaries has seen a significant shift with the advent of generative AI and chatbot technologies (De Schryver, 2023; Lew, 2023; Łodzikowski et al., 2024; Rees & Lew, 2024). We report on a study comparing the use of a mobile dictionary (Longman Dictionary of Contemporary English) and ChatGPT—an innovative conversational agent—in assisting advanced English learners in lexical tasks involving both reception and production. To assess the effectiveness of these tools, participants engaged in a series of paper-based lexical exercises designed to evaluate how each resource supports advanced students of English in solving lexical problems. For the study, a cohort of 223 advanced college-level learners of English were divided into two groups: one utilizing ChatGPT-3.5 (freely available version) and the other using a mobile dictionary (Longman Dictionary of Contemporary English) to complete the same set of tasks. Performance was measured in terms of accuracy and time-on-task. In the reception task, which involved understanding, interpreting, and translating English words into Polish, participants using ChatGPT demonstrated a significantly higher (assuming an alpha level of α = 0.05) rate of correct answers compared to those using the mobile dictionary (Odds Ratio 3.46 [CI95% 2.48–4.83], p < 0.001). Similarly, in the production tasks that required active completion of partial translations, the ChatGPT group outperformed the dictionary group (Odds Ratio 5.38 [CI95% 3.54–8.16], p < 0.001). Time-on-task was also recorded, with results showing ChatGPT to be a more efficient tool in terms of the time needed for participants to answer questions in the production task, but not in reception. This suggests that the instantaneity and interactive dialogue offered by ChatGPT enables learners to process and apply lexical items more swiftly than with conventional dictionary searches, thereby streamlining the consultation process. Apparently, LDOCE users were not always able to locate the relevant information even though it was present in the dictionary (we made sure that this was the case). The analysis extends beyond performance metrics to explore practical advantages and potential drawbacks of using ChatGPT in an educational setting. Among the benefits, the research highlights ChatGPT’s ability to provide immediate, context-relevant assistance and to engage learners through interactive and adaptive dialogue. This can potentially lead to enhanced retention and application of new language forms. Conversely, potential limitations include the risk of overreliance on AI for language inputs, reduced memorization efforts, and the challenge of ensuring accurate and pedagogically appropriate responses from the chatbot. The superior performance in both reception and production tasks underscores the importance of considering AI-driven resources as serious competitors to traditional dictionaries in language-learning contexts. However, this study has not examined bilingual dictionaries, which, as earlier studies with rigorous design show (e.g., Lew, 2004), may well be better suited to language learners at all levels of proficiency. Indeed, early results from follow-up study in progress suggest that a bilingual dictionary may result in at least as good success rates as ChatGPT, at least in Reception. We stress the need for a balanced approach, incorporating AI while maintaining essential strategies for independent language learning and critical thinking. The study advocates for further exploration into the optimized integration of AI-based tools into the language learning curriculum, ensuring they complement rather than replace learner autonomy and traditional pedagogical resources such as dictionaries.
The paper introduces a web portal of integrated dictionaries for Bulgarian. The mapping among the resources is lemma-based. Two dictionaries are in the centre of this integration – an Inflectional dictionary of Bulgarian, since Bulgarian is a morphologically rich language, and a Wordnet of Bulgarian - BTB-Wordnet, since it adds the level of lexical meanings to the dictionary-enhanced knowledge. Also, various other types of dictionaries are being gradually added – with diachronic spellings, bilingual, specialised, etc.
This study aims to deliver an analysis of two most popular (Liu, Deng, & Yang, 2020) Chinese-English online dictionaries: the New Chinese-English Dictionary in the Youdao Dictionary app and the Online Collins Chinese-English Dictionary. There has been a long-term demand for high-quality language learning resources from the growing population of Chinese learners of English. Moreover, the development of a Chinese-English bilingual dictionary often faces a formidable challenge owing to disparities in the linguistic structures of these two languages (Zgusta, 1971; Chen, 2004; Hartmann, 2007; Shao, 2019). The primary objective of this study is to systematically identify and categorize the shortcomings of the dictionaries, laying the groundwork for future improvements. The scarcity of reliable online dictionaries poses an impediment to effective language learning, necessitating a comprehensive examination of existing tools to identify the weak points and opportunities for improvement. Online dictionaries are considered one of the most convenient tools for language learners in today’s technology background. Although de Schryver (2003) claimed that paper dictionaries are “unbeatable” in certain aspects, it is still generally assumed that online dictionaries, by offering new features while retaining characteristics of traditional dictionaries, can greatly aid users seeking help with word meaning and use, which has been shown in many previous studies (Dziemianko, 2010; Chen, 2012; Zhang & Pérez-Paredes, 2021; Zhang, Xu, & Zhang, 2021). However, other researchers have reported that digital dictionaries’ effectiveness in language learning is still uncertain (Lew, 2014; Ferrett & Dollinger, 2021; Gilquin & Laporte, 2021; Chen & Liu, 2022) due to the insufficient dictionaryusing skills of the users and the limitations of the dictionaries themselves. While the use of dictionaries has rapidly shifted towards digital mediums, there remains a gap in research on learners’ dictionary needs and preferences in online formats and the strengths and deficiencies of current online dictionary products. The methodology of this study involves an analysis of 30 entries in the two resources mentioned above. These 30 entries, including 15 verbs and 15 nouns, have been randomly chosen from various language learning resources that are representative of the vocabulary that Chinese learners are likely to encounter and use: five come from the top 200 frequency list in zhTenTen corpus, and each two of the rest come from each word-list reference book of the five major English tests in China: IELTS, CET6 (College English Test Level 6), TEM8 (Test for English Majors Level 8), CATTI (China Accreditation Test for Translators and Interpreters), and Postgraduate Admission Test. Through this analysis, various issues in both content and structure are identified and categorized. Categories include incorrect definitions, inadequate explanations, outdated information or obsolete terms, limited synonyms and antonyms, lack of or mismatched contextual examples, lack of pronunciation/grammatical/special usage guidance, inconsistent, unclear, or disorganized structure, etc. These findings present a broad view of the challenges that currently exist in online Chinese-English dictionaries. To augment this analysis, a targeted questionnaire is administered to intermediate-to-advanced-level Chinese learners of English, specifically those with B2-C1 proficiency levels, in order to elicit insights into the learners’ usage habits and unmet needs concerning online bilingual dictionaries. The questionnaire was designed to cover various aspects, including frequency of use, perceived reliability, and satisfaction with specific features such as example sentences, pronunciation guides, and contextual usage. By integrating user perspectives into the evaluation process, this study seeks to unveil additional challenges that may not have been discerned through the initial entry-based analysis. Based on the findings from both the entry analysis and user feedback, several recommendations for improving Chinese-English online bilingual dictionaries are: provide more accurate and detailed definitions with clear distinctions between different senses of polysemous words; enrich explanations with comprehensive information to aid learners in understanding the full scope of a word’s meaning and usage; regularly update dictionaries to remove obsolete terms and incorporate contemporary language usage; include a wide range of contextual examples, supported by audio pronunciations and grammatical guidance; improve the structure and organization of entries to enhance user experience; incorporate more interactive features, etc. In conclusion, this study highlights the critical shortcomings of current Chinese-English online bilingual dictionaries and offers insights for future improvements. Addressing these issues can help create more effective and userfriendly resources, better supporting Chinese learners of English in their language learning journey.
Introduction
Technical languages contain expressions that are not universally understood. We call non-lexical entities (NLEs), i.e., single- or multi-word expressions not listed in domain dictionaries. These are especially difficult to differentiate from lexical entities, when domain dictionaries are small or incomplete, which is often the case for low-resource languages. The medical domain further complicates this issue through specialized written language, jargon expressions, use of multiple languages, and containing various entities that are not existent in domain lexica. This work will focus on four NLE categories in the Croatian medical domain: (i) short-forms, i.e., abbreviations or acronyms, (ii) deviations from standard spelling, lexical variants, misspellings, mistyping, (iii) brand names, and (iv) proper names. The choice of categories was influenced by related works in various languages underlining the challenges of short-form ambiguity (Schwarz et al., 2021) structure, abbreviations and conformity to the Austrian Electronic Health Records (ELGA, and information retrieval tasks (Gendrin et al., 2023; Raja, 2022). Large language models (LLMs) offer opportunities with the contextual understanding present in the trained language models to identify NLEs and annotate these portions of text without following a named entity recognition (NER) approach, but instead prompting for the solution with tools, such as ChatGPT, an LLM for text generation and text synthesis.
Dataset
The dataset consists of health forum entries in Croatian, crawled from online websites and annotated by a domain expert. These texts blend typical lay language with pasted fragments of clinical documents, with the aim to receive physician authored advice or answers. NLEs make processing texts harder, as most smaller language models are trained on text data that use standardized language from domain dictionaries. Furthermore, NLEs introduce ambiguity into texts and extracting the context is necessary to be able to distinguish between dictionary content and NLEs, e.g., OPIS (‘description’) in uppercase letters appearing in a text could either be a section header or an abbreviation. The differentiation can only be fully extracted from the context. The dataset is split into 80% training, 10% validation, and 10% test set. The four mentioned groups were annotated and exported in a BIO (beginning-inside-outside)-labeling format for sequence modeling. Expressed uncertainty among forum users further served as motivation for the choice of NLE categories.
Methodology
For a comparative baseline without LLMs, a NER approach with fine-tuned BERT (Devlin et al., 2019) and ELECTRA (Clark et al., 2020)they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute models was performed. To accomplish the automatic identification and annotation of NLEs, ChatGPT was prompted for each run-through with the prompt seen in Figure 1. Fine-tuning, in the context of LLMs, refers to the injection of training examples as prompt answers to create a specialized downstreamed language model of the given LLM, which is then applied for prompting. First, the model ‘gpt-3.5-turbo’ was applied, without finetuning. The training dataset was utilized for manual prompt engineering. Second, the model ‘gpt-3.5-turbo-1106’ was fine-tuned twice with different sized subsets of the training dataset. The investigation was focused on the impact fine-tuning has on the performance and limiting the amount of training samples for cost reasons. All methods were evaluated with precision, recall and F1-measure for exact prediction matches.
Results
The baseline approaches reached an exact prediction of NLEs with an F1measure of 0.88 (BERT) and 0.91 (ELECTRA). Prompting for the annotation of Croatian health forum entries without fine-tuning resulted in poor quality results, i.e., an F1-measure of 0.48, where most NLEs were not identified. Through finetuning, the performance increased to an F1-measure of 0.82, even with only 100 sentences, pprox.. 98,000 tokens. In the final fine-tuning step, with 1,000 sentences of the training set, less than one third of the full training dataset, similar results as the baseline in F1-measure were reached with the multilingual BERT and specialized ELECTRA models (see Table 2), while surpassing the baselines in precision.
Discussion
The best model was able to outperform both baselines in precision, and outperform the multilingual BERT baseline in F1-measure, but still fell slightly short of the specialized ELECTRA model. This suggests that LLMs can be finetuned with only a third of the data in comparison to the baseline methods to reach state-of-the-art results. The largest portion of errors with the non-finetuned language model stemmed from insufficient contextual understanding of the prompt, leading to misclassification of NLEs as lexical entities. Fine-tuning significantly reduced these errors, particularly in differentiating lexical variants. A detailed error analysis revealed that misspellings, mistyping, and diacritic variations posed challenges, possibly due to the composition of pre-trained LLM datasets, primarily in English. Lexical variants emerged as the most error-prone NLE category, followed by short-form content, while brand names and proper names surpassed baseline models. Given the complexity of medical terminology and limited resources for Croatian medical texts, this approach is relevant for the identification, classification, and inclusion of NLEs into domain dictionaries, as well as for automating language resource creation.
Conclusion and Outlook
Our findings reveal ChatGPT’s potential for automating labeling in the Croatian medical domain, while reaching similar results as state-of-the-art approaches with less data needed. Automated annotations can enhance datasets for low resource languages, increase annotated dataset creation and expansion for dictionaries, and reduce human annotation hours. Future work will involve extending this workflow, optimizing fine-tuning, and employing natural language processing techniques to further process the identified NLEs.
Mastering idiomatic language in its broadest sense is necessary to achieve advanced levels in language learning. Therefore, phraseological information should be quickly and easily available to language learners. To this end, the Dutch project Woordcombinaties (Word Combinations) is developing an integrated lexicographic resource combining a collocation and idiom dictionary with a pattern dictionary. It merges a word-in-context and collocation tool, following the example of SKELL (Sketch Engine for language learning), and a pattern dictionary, following examples such as the Pattern dictionary of English Verbs (PDEV) and T-PAS for Italian (Ježek et al., 2014). Woordcombinaties is based on a corpus of about 200 million tokens of primarily newspaper material from both the Netherlands and Belgium to reflect language variety in Dutch. The corpus is syntactically parsed with the Alpino parser (van Noord 2006) and uploaded into Sketch Engine. Pattern editing (using Corpus Patterns Analysis (CPA) (Hanks 2013)) is supported by SKEMA (Sketch Engine Manual Annotation), a specially developed corpus pattern editor system (Baisa et al., 2020). Pattern editing is still very much a computer supported manual task and as such extremely labour-intensive. In this poster we present some experiments exploring ChatGPT’s performance in the context of pattern editing, i.e., pattern generation, pattern classification, semantic type annotation. For all tasks we have used few-shot prompting providing ChatGPT-4 with at least two examples as well as clear instructions on the expected output (see Figure 1 for an example prompt). The results were evaluated by comparing them to the manually annotated data. You are a lexicographer working on Corpus Pattern Analysis as developed by Patrick Hanks. You know that the Dutch verb ‘boeken’ has 4 patterns defined by these implicatures: 1. iemand schrijft iets in de boekhouding [...]
You have to disambiguate a list of Dutch sentences according to these implicatures, with the output in a json object with a single property “concordances” which is an array containing for each sentence: the sentence and an implicature number. For each sentence, you return the following information:- sentence
: the sentence itself - implicature
: the number of the implicature according to the following json format for each sentence: {“text”: <sentence>, “implicature”: <implicature_number>}. Here is an example for the disambiguated output: { “concordances”:[ [ {“text”: “Het bedrijf boekte een omzet van 53,78 miljard euro” , “implicature”: “1”}, {“text”: “Ze boeken vlucht en hotel apart, zoeken zelf wel een Airbnb” , “implicature”: “3”}, …] Now process the following sentences: We hoeven dus geen hotel meer te boeken als we naar zee willen. Met de nettowinst van 67 miljoen euro boekt Adecco 40 procent minder winst dan 2012.
Initial results suggest that, of the three tasks that we explored, pattern generation is the most difficult. ChatGPT tends to copy from the example patterns provided in the input in the patterns it generates. ChatGPT also struggles with semantic type annotation, whereas the results for pattern classification are rather promising. The ongoing study using ChatGPT will provide comprehensive insights into its performance in terms of CPA-like analysis.
The purpose of the conference paper (poster) is to present an electronic contextual repository consisting of DA – dictionary agonyms (lexical innovations, but also unique, extremely rare, previously unrecorded words – cf. Bartwicka et al., 2007; Fedorushkov, 2009) extracted from a corpus of Russian regional press texts from the turn of the millennium (1996–2006), with a special focus on neonymic lexis (neologisms and occasionalisms). The repository does not contain dialectological data and metadata, but words are assigned information about the location of appearance in the usus. The poster includes information about the data corpus, research methods and technical methods, the structure of the repository. 10 000 of DA are retrieved from the press corpus (Fedorushkov, 2023) consisting of archives of more than two hundred regional Russian newspapers from the indicated period. The methods of retrieving words from the corpus are both traditional (manual) and automated. All words of the press corpus (about 250 million) were given verification regarding their occurrence in earlier selected dictionaries of the Russian language (e.g., Ozhegov & Shvedova, 1992 and also Ushakov, MAS, BAS and others). Tools for obtaining words not included in the Zalizniak 1987 Russian dictionary – are tagger I parcer based on AOT technology (morphoanalyzer with a database of words and generated wordforms derived from the Zaliznyak 1987 dictionary). Excerption filters for DA are generated in REGEX syntax along with coding for grammeme clusters allowing to select separate parts of speech. The obtained list was verified against another list of words from other Russian language dictionaries – including (manually) – from dictionaries of neologisms (e.g., by Kotelova, Milekovskaya, Soloviev, Butseva, Levashev and others). The total list of verification dictionaries is about forty. Also described are the algorithms and analytical methods that were used to extract, process and organize DA. Selected DA are unique words not registered in dictionaries before 2006 (the year of verification activities), so there are potential neologisms among them. The excerpting work also ended in the indicated period (cf. Fedorushkov, 2008). Due to the enormity of technical and substantive work, later lexicographic sources were not taken into account. The selection of contexts for DA from the corpus lasted for nearly 16 years according to the principles of linguo-chronologization described in the monograph Wierzchoń (2008). One of these principles is the selection of the earliest context of DA occurrence. The selection of contexts was done through the dtSearch indexer, which allows to define an alphabet for segregating words from electronic texts. The vocabulary in the Repository is placed in two correlated lists using a network of hyperlinks in HTML technilogy. The arrangements of these indexes are alphabetical: the first index is “from the beginning of the word” (a fronte), while the other is in inverted order – alphabetically “from the end of the word” (a tergo). In this way, each DA can be viewed in two lists – indexes. Additional information relates to the use of the Repository with special attention to the context of DA occurrence with an accompanying map of the region and city, the date of registration and the name of the regional newspaper in which it was registered. In view of this, each DA is provided with a distinction in the context of use and geochronological information. The territorial range of DA occurrence – practically the entire territory of the Russian Federation before 2007 – the locations will be presented in the poster in the form of the infographic. The poster shows infographics on the expansion of the growth of agnonymic lexis by region and years. The DA obtained from the selected period are largely composites (MisturskaBojanowska, 2013) of the type SMS-ругательство, смс, онлайн-болтовня, фэшн-неделя, FM-станция, аудиовидеосинхронизация, профайлер, путиномания, путиномика, брендмейкер, порноюмор, IT-технология, sms-голосование, лонг-дринк, интернет-переписка, смарт-фон, смарттелефон with a tendency toward affixation, i.e., with the presence of a formant like affix, affixoid, semi-affix, radixoid, radix (cf. classifications in Bartkov & Minina, 2019). Also in the poster is placed information regarding research paths in the analysis of neonymic lexis. With the help of the Repository provided as an integral part of the poster presentation, it is possible to observe precisely the growth of particular types of word-forming tendencies in the development of language. The Repository is aimed at a wide range of lexicographers.
The development of Croatian dialectal lexicography has been written about several times. For example, Lisac (2005; 2018) listed previous publications that could be called dialectological dictionaries or word lists; to the work from 2018, he attached valuable maps (created by Tome Marelić) with marked places and areas that have their own dialect dictionary. A list of selected Croatian dialect dictionaries is provided by Menac-Mihalić & Celinić (2012, pp. 300–302) and Samardžija (2018, pp. 163–167), and similar can be found elsewhere (for example, see under “Additional literature” for the course Croatian Dialect Lexicography at the Faculty of Humanities and Social Sciences, University of Zagreb: https://theta. ffzg.hr/ECTS/Predmet/Index/1719). A complete bibliography of Croatian dialect dictionaries has not yet been created, and the question is what should be eventually included in it. Namely, it is sometimes difficult to determine what we could consider a dialect dictionary, and what is just a list of words – works of an explanatory nature such as the ones mentioned and lists of selected dialect dictionaries are certainly a good guide for determining the boundary. For a true judgment, however, it is necessary to review and qualitatively assess all the listed works that could be considered dialectal dictionaries. Although several authors have written about the principles of creating dialect dictionaries and the design of their entries (Kapović, 2008; Tamaro, 2008; Blažeka, 2006; 2008; 2019), and the principles of creation are often included in the dialect dictionaries themselves (see, for example, Oštarić & Vranić, 2016, pp. 105–110; Kovačec, 2020, pp. 13–16), even when the author is an amateur (see Kranjčević, 2003, pp. XVII–XX), there are actually rare cases in which authors of dialect dictionaries follow good theoretical and practical examples. Namely, not all authors have the same ambitions when creating dialect dictionaries and often the ambitions of amateur authors do not coincide with the needs of the dialectology profession. While some authors, mostly professionals, try to be consistent in the choice of words to include in the dictionary and in the formatting of dictionary articles according to certain criteria, amateurs are often content with compiling alphabetical lists of words that deviate in one way or another from the standard language, with minimal linguistic processing of the material. This paper will try to look at the motives behind the creation of dialect dictionaries and assume, as far as possible, in which direction Croatian dialect lexicography will develop. We will see how productive Croatian dialect lexicography really is, who the authors of dialect dictionaries are and whether it is possible to predict the tendencies of dialect dictionary production based on quantitative data about them and their motives. For this purpose, a database of 130 dialectal dictionaries of Croatian local dialects or groups of Croatian local dialects within the borders of the Republic of Croatia was created, with basic bibliographic data about these dictionaries (author’s name and surname, year of publication, scope of the work, etc.) and metadata that are mostly drawn from their preface and afterword (e.g., author’s age, author’s profession, motives for creating the dictionary, etc.). Dictionaries that were published within the journal were excluded, although among them there are important contributions to Croatian dialect lexicography.
This paper investigates Czech territorial dialect lexicography, particularly the incorporation of lexicalized phonological and morphological phenomena into differential dialect dictionaries. It examines the methodologies used in current Czech dialect dictionary creation. The study analyzes corpora of the Czech National Corpus (CNC), including ORAL v1, ORTOFON v2, and DIALEKT, to trace the lexicalization of dialect features. Observing lexicalization in Czech dialects through corpus analysis elucidates its various phases, from initial frequency increase to subsequent restriction in specific lexical units. Through illustrative examples (semi-semiconsonantal u, hard ł, the nominative plural ending of the masculine animatum noun -í), the study sheds light on lexicalization’s evolution and distribution in contemporary spoken Czech. Additionally, it addresses challenges in documenting regular and irregular dialectal variations and proposes that lexicalized forms, even when they do not alter the lemma, should be included in differential dictionaries. Such an approach would enhance the representation of dialectal differences, contributing to a more comprehensive understanding of Czech dialectology.
Indonesian has been designated as the 10th official language of UNESCO General Conference. Consequently, the language development, including word update and dictionary management, is inevitable. So far, the Indonesian Comprehensive Dictionary (KBBI) has been open to the loanwords from other languages, including foreign languages and local languages. This paper compares (i) East Asian loanwords in Oxford English Dictionary (OED) and KBBI and (ii) update management that is regularly implemented by Collins English Dictionary (CED). 867 loanwords from East Asian Languages are documented in OED, including Japanese, Chinese, and Korean. The loanwords are identified by their semantic categories. On the other side, KBBI has now documented 446 Chinese dialects loanword that are classified to Hakka, Hokkien, Hokkien Fuzhou, Hokkien Quanzhou, Hokkien Tong’an, Hokkien Xiamen, Kanton, Hokkien Zhangzhou dialect, 90 Japanese, and 12 Korean loanwords. The research found that at least 26 semantic categories and about 98 topics of East Asian loanwords have been documented in OED, meanwhile KBBI records 18 semantic categories and 38 topics of East Asian loanwords. The research provides the information of semantic categories and topic of the loanword that are identified by genus proximum. Genus proximum has been the main concern of many lexicography experts. Zgusta (1971) classified the structure of definition into two parts, which is called genus proximum and differentia specifica. He mentioned that genus proximum is the hypernym and differentia specifica is the additional semantic feature. Hartmann and James (2002) shared similar perspective by explaining that the genus proximum, in classical definition, is the first part of the word that is explained and considered to be a specific instance. Semantically, it is a superordinate word (hypernym) to which the word, that is called subordinate (hyponym), is defined. Heyvaert (2016) mentioned that to systemize the meaning description in the dictionary, the existence of semantic core, modified by some features and added by further attributive information is mandatory. From 26 semantic categories (e.g., natural science, politics, crafts and trades, etc.) and 98 different topics (e.g., needlework, jewellery, ceramics, earth science, chemistry, medicine, government organization, etc.) that are documented in OED, consumables (e.g., doenjang, makkoli, anago, daikon, moo goo gai pan, twankay, etc.) and arts (e.g., cheongsam, dong bao, hikikomoro, marumage, aegyo, hallyu, etc.) are the highest topic in terms of number that have been published so far. A similar condition applied to KBBI. Consumables (e.g., bakwan, dimsum, mirin, yakiniku, bingsu, gocujang, etc.) and culture topic (e.g., hanbok, oppa, sundoku, hanami, liangliong, comblang, etc.) are recorded as the highest among 18 semantic categories (architecture, sport and leisure, transportation, etc.) and 38 topics (building, cemetery, place, game, ship, train, etc.) from East Asian loanwords that supporting those topics to become strongest points of all classifications. It indicates that the East Asian has strong power in terms of cultural diplomacy. Taking everything into account, economics, literature, and administration topics are not found in KBBI, while OED has the lack of daily activity and expression topic in the dictionary. The focus of the research then moves to the crowdsourcing practice in Collins English Dictionary (CED). The research points out several information of the crowdsourcing management in CED, namely (i) the lemma proposal has to be verified through several processes, (ii) the dictionary has documented 206 word suggestions since the first launching, (iii) every suggestion has its own verification note that indicates the ongoing process, and (iv) the word suggestion formula is quite simple, consisting the word that will be suggested, definition, and additional information to support the word suggestion. The highlight of crowdsourcing management system of CED are the best practice that can be recommended for crowdsourcing management system of KBBI. Besides, semantic category and topic that is not found in KBBI can be the inspiration of the following word update term.
Numerous linguistic studies have shown that languages differ in the categorization and segmentation of event experiences, and that they can systematically express events with more or less detail regarding their constituent elements such as path, manner, cause, ground, etc. (e.g., Plank, 1984; Talmy, 1985; Slobin, 2000; Özçalışkan & Slobin, 1999; 2003; Stathi, 2023). For example, Talmy (1985) observed that there are languages that typically conflate path and manner of motion in the same verb (satellite-framed languages), and those that encode path and manner of motion by separate linguistic units, and thus often omit to express manner altogether (verb-framed languages). Drawing on Talmy’s typology, this paper investigates differences in the encoding of caused accompanied motion in two typologically different languages, Turkish and Croatian. More specifically, it examines how the directional verbal concepts BRING and TAKE are lexicalized in Turkish, as a verb-framed language, and in Croatian, as a satellite-framed language. Caused accompanied motion can be defined as a threeparticipant event in which the agent moves along the same trajectory as the theme object (animate or non-animate), and in which the directionality can be deictically specific or non-specific (cf. Margetts, Riesberg, & Hellwig, 2022). The aim of this paper is to identify differences and similarities between the two languages in terms of where they place boundaries between different events, what linguistic means they use in the process of event lexicalization, and what aspects or components of the event they include or omit in this process. The study focuses primarily on the analysis of two semantically general Turkish deictic verbs getirmek ‘bring’, and götürmek ‘take to’, and their Croatian translation equivalents (verbs mostly prefixed by do- ‘to’ and od- ‘from’). For the purpose of the analysis, we examined the contextual use of the selected verbs in Turkish and Croatian corpora, the way their meanings are presented in monolingual and bilingual dictionaries (TDK Sözlük; Hrvatski jezični portal; Püsküllüoğlu, 2005; Hrvatski enciklopedijski rječnik, 2002; Yeni Türkçe-Sırpça Sözlük, 2014; Boşnakça-Türkçe Sözlük, 2015), and the way Turkish verbs are translated in Croatian translations of several Turkish novels by Orhan Pamuk. The preliminary results point to some important differences between the two languages. Compared to Turkish, Croatian shows a higher level of granularity in the partitioning of the semantic domain of caused accompanied motion. For example, in addition to the path, which is encoded by a prefix, Croatian verbs obligatorily encode the manner of caused motion (e.g., odvoditi ‘to take something/someone by walking it/one’ vs. odnositi ‘to take something/someone by carrying it/one’), they can distinguish animate theme objects (odvoditi), and they offer a choice to express a vehicle-supported transportation (odvoziti), while Turkish verbs omit to encode all three specificities. On the other hand, Turkish shows a greater tendency to segment an event on the temporal level by using serial verb constructions (e.g., alıp getirmek ‘to take up and bring’, lit. ‘to take up bring’), while in Croatian such segmentation seems rather redundant (as observed in Croatian translations of such Turkish phrases).
In this paper a doctoral study on the description of semantic change in the Swedish Academy Dictionary (SAOB) is presented. The starting point for the study is semantic labels like figurative and in extended use. Five such labels in SAOB are examined, mainly with methods from the cognitive linguistic framework. The results show, among other things, that the most labelled mechanism is metaphor, that metaphor and metonymy often co-occur and that such co-occurrences often are expressed and explained in the dictionary with combined or modified labels. Furthermore, there seems to be a certain overlap between some of the labels in use, and one could question if all five labels are needed. Since the first volume of SAOB was completed in 2023 and a revision project now is launched, it is possible to make practical use of the results of the study in the dictionary. Therefore, some practical outcomes and applications that are either planned or already started are also presented in this paper.
This study discusses the possibilities of expanding the scope of the largest Estonian dictionary – the EKI Combined Dictionary – with various types of constructional information. Designing a representation of constructions essentially means building a constructicon. The study starts with a short overview of existing constructicons and the main challenges their creators have faced so far. We address these issues from the point of view of data model reorganisation and database restructuring. Extending the lexicographic resource with constructicographic information is twofold: the existing constructional information must be migrated into a new model and then complemented with additional constructions extracted from a corpus.
This article examines advances in phraseomatics and digital phraseography through the DiCoP project and its DiCoP-Text corpus, aimed at enriching linguistic models and machine translation. The project evaluates the frequency of use of phraseological units (PUs) and improves their translation in different contexts, drawing on recent research in phraseotranslation and natural language processing (NLP). It emphasizes French-Chinese and Chinese-French language pairs. We integrated 549 PUs from the novel The Three-Body Problem by Liu Cixin for our tests. Various processes, such as tokenization, identification, alignment, and annotation, were used to improve the translation of PUs. DiCoPText, a comprehensive database including newspaper articles, literary works, and textbooks, aims to enhance the performance of language models (LMs).
The application of crowdsourcing in the creation of educational resources, understood as the gathering of collective intelligence for pedagogically-oriented tasks, has garnered considerable attention in recent years. Advanced internet technologies facilitate collaborative content creation between learners and educators, potentially enhancing the learning experience. Crowdsourcing has emerged as a vital aspect of online education, because it promotes the openness and exchange of resources and knowledge within communities and user groups. However there has been insufficient research of crowdsourcing application in the creation of user-generated educational materials. In the present study, we introduce an elaborated methodology intended for the design of the crowdsourced resource gathering system for creation of custom-tailored user-generated content-controlled educational materials. Crowdsourcing in education refers to an online activity wherein an educator or educational institution invites a group of individuals through an open call to directly assist in learning or teaching. However, a significant issue arises: this approach does not necessarily produce a controlled or more precisely content-controlled version of the materials. A major problem of educational crowdsourcing – loss of control by the natural controller of educational material. Consequently, this method does not necessarily produce a contentcontrolled version of the materials. As a result, a student who studies French as a foreign language might receive, for example, material on irregular verbs in English instead of relevant educational content on French vocabulary. While accurate, this content is not pertinent to their French curriculum, or even entirely distorted or false information. Therefore, it is essential to incorporate an approval stage to the resource-gathering process that includes peer review and relevant feedback. Involving students in the development of their learning materials can greatly enhance their engagement and understanding. The proposed methodology presents a scenario when a teacher invites the students to participate in creating course materials. This approach aims to promote active learning, foster a sense of ownership among students, and enrich the classroom resources with diverse perspectives. The proposed methodology uses a three-phase model and solves the issue of unsuitability by adding a content approval phase into the process. This methodology comprises three major stages of content development: 1. definition of the topic and subtopics, 2. content creation, 3. content approval. A more detailed procedure for submitting the content to the platform involves an end-user (registered or anonymous) that supplies content according to some predefined format: • During the first phase users receive the topic for contribution, then suggest relevant sub-topics and later they vote for relevant sub-topics proposed by other users; • The second phase requires users to choose a sub-topic and provide several data items; • The third phase requires user participation in a crowd-rating process in which users must vote for provided data items and work out a mechanism to solve disagreements. The approval stage plays a crucial role in this methodology, setting it apart from others. The crowd-rating process occurs twice: initially for the approval of sub-topics in phase 1, facilitating customized content creation, and later in phase 3, resulting in a controlled version of materials. If there is a disagreement among the crowd raters, various options are available: these range from increasing the number of raters to seeking input from domain experts by sending either the entire item or a portion of it, or using AI tools, like ChatGPT, to solve the conflict, or employing a combination of these approaches. The proposed methodology is currently being developed by a team of computer science specialists and will subsequently undergo testing by a group of language students who will be engaged in developing educational materials for a particular course. For example, in a high school history class, students can provide unique insights and diverse resources that enrich the learning materials. Similarly, in a university language course, learners can share and refine study aids that cater to their specific needs and interests. The type of educational materials would be specified by the teacher who assigns a specific topic to a particular student or group of students. The students follow the three steps of content development, previously discussed in the proposed methodology. We believe that this methodology along with its principles and phases, could be applied for generating educational materials across different fields and disciplines. When used in the right setting and aimed at the correct audience, crowdsourcing can offer valuable contributions and help create relevant and high-quality content.
This ongoing research to obtain a master’s degree in Linguistics at the Federal University of Ceará explores the extraction of specialized collocations and their analyses for the creation of a corpus-based bilingual glossary of legal discourse in American English and Brazilian Portuguese. The specialized collocations were extracted from a legal English corpus constituted of the subtitles from 134 episodes of the North American TV Series “Suits” (CS), which was submitted to analysis using the software Sketch Engine (Kilgariff et al., 2014). The comparable corpus English Web 2021 (enTenTen21) was chosen to find further evidence of usage and co-occurrence. Fromm (2011) categorizes television series depending on their use of terminology from completely fictional to a portrayal of real-life communication. CS can be considered a type of specialized text (Pavel, 1993), exploring the variety of legal discourse (legalese), encompassing here its written and printed aspects and its oral elements (Hoffman, 1998 apud Finatto, 2015). Monteiro-Plantin (2014) indicates lexical unit combinations are multiword units of relative stability, with a certain degree of idiomaticity, and conventionally used in specific situations. Specialized collocations would be the ones found in specialized discourse, such as legal terminology (Bevilacqua, 2005). Following Orenha-Ottaiano (2016, 2021) and making use of Corpus Linguistics with the help of the Sketch Engine tool, we were able to compile CS, counting 1,203,293 tokens and 960,603 words. Using different Sketch Engine tools (Keywords, Word Sketch and Concordance), we collected data to compose a list of keywords in descending order according to their frequency in CS. It was manually analyzed to exclude grammatical words and select the relevant results. Respecting the descending order of their keyness score (likelihood of the unit pertaining specific terminology), we selected fourteen words to serve as nodes for researching the collocations. The selection of candidates for specialized collocations was done considering a typicality score not lower than 3 (Glabasova, Brezina, & McEnery, 2017, Orenha-Ottaiano, 2016), as it would otherwise indicate that the co-occurrence of the words involved does not suggest sufficient stability or fixity, resulting in 41 candidates. The Concordance tool presented the nodes in context, organized by lines, centered and highlighted to facilitate the analysis of the elements that accompany them to define whether they are valid candidates for a lexical combination with compositional meaning. Those combinations that present relevant frequency throughout the research corpora were then analyzed in their morphosyntactic formation. According to Sinclair (1991), a collocation is composed by a node and a collocate, the node being the search word of research interest, while collocate would be what accompanies it. Hausmann (1985) understands the collocation as being composed by a base and a collocate, the base as an independent element, semantically autonomous and understandable/ translatable regardless of the collocation, while the collocate would be a modifying concept, interpretable within the collocation and depending on it for translation. Once all the collocations are analyzed, the construction of the glossary will be based on the methodological approach of Faulstich (2011) and the analysis of a corpus of ten lexicographic products focused on legal terminology selected from the web. The goal of this research is to assist language users in navigating the specificities of legal terminology and to explore the lexicographic approach towards languages for specific purposes. Furthermore, the corpus being constituted of a cultural product is an aspect which narrows the gap between academy and community, whom these results should serve in the first place.
In order to correctly use a word in a foreign language it is not enough to know “its meaning” (i.e., its translational counterpart in the native language). It is also necessary to identify the appropriate contexts of the word’s use which often differ from those of its counterparts in other languages. Bilingual dictionaries cannot represent all the contextual properties that distinguish entry words from their equivalents, best of them just provide several examples for each translation option. These examples, however, are not always indicative of which option is to be chosen in every particular context. The task of differentiation between translation options can be systematically approached from a typological perspective. As shown for various semantic domains (e.g., Lander et al., 2012; Rakhilina et al., 2022; Ryzhova et al., 2024), a cross-linguistic comparison of contexts featuring semantically similar words in several languages reveals recurrent patterns of lexical oppositions. These patterns can then serve as a basis for contrastive lexicographic descriptions and ultimately contribute to the construction of bilingual dictionaries of an active type (cf. Apresjan, 2012). In this paper, the principles of such a typological approach are illustrated with a study of adjectives pertaining to density and thickness of physical objects. The study is based on a sample of 25 languages that represent 8 families (IndoEuropean, Uralic, Northwest Caucasian, Northeast Caucasian, Kartvelian, Semitic, Sino-Tibetan, Japonic). The data were obtained from corpora and interviews with native speakers of languages under investigation. In accordance with the technique developed within the frame-based approach to lexical typology (for an overview of various approaches, see KoptjevskajaTamm et al., 2015), differences between languages are described in terms of frames, i.e., typical situations which are relevant to a given semantic domain. The notion of frame in lexical typology inherits the key properties of the Fillmorean frame (Fillmore, 1978), but includes also taxonomic and some other restrictions on the slots filled by the arguments (for details, see Rakhilina & Reznikova, 2016). In case of qualitative terms, frames are largely determined by the type of objects these terms describe, namely, by their taxonomic, mereological and topological properties (Rakhilina & Reznikova, 2022, cf. also Talmy, 1983; Rakhilina, 2000; Kozlov & Privisentseva, 2022 on linguistic effects of the naïve geometrical classification of physical objects). The following frames were identified as underlying lexical oppositions in the THICK/DENSE domain: 1. 2. 3. 4. Dense sets (whose parts are close to each other, cf. ‘dense forest/crowd’) Thick substances (which are difficult to see through or not flowing easily, cf. ‘thick smoke/soup’) Thick layers (i.e., thick flat objects, cf. ‘thick blanket/book’) Thick pivots (i.e., thick elongated objects, cf. ‘thick tree/finger’) Based on these frames, we can build a typology of systems encountered in our sample. In the simplest systems, the frames 1–4 are distributed between two terms. Such binary systems are attested in three versions: • Frame 1 is opposed to the rest of the domain, cf. German dichter Wald ‘dense forest’ vs. dicke Suppe ‘thick soup’, dickes Buch ‘thick book’, dicker Baum ‘thick tree’; • Frame 4 is opposed to the rest of the domain, cf. Kabardian (Besleney dialect) ʁʷəm ‘thick’ (e.g., about a rope) vs. ʔʷəv (about forest, porridge or a layer of snow on a windowsill); • Frames are evenly distributed between two terms: frame 1 is colexified with frame 2, and both of them are opposed to frames 3-4, cf. Russian gustoj les ‘dense forest’, gustoj dym ‘thick smoke’ vs. tolstyj pled/ karandaš ‘thick blanket /pencil’. Ternary systems in our sample colexify either thick layers and pivots and use specific terms for sets and substances, as is the case in Armenian, or jointly express the thickness of substances and layers and have separate terms for sets and pivots, as in French and Georgian. Finally, our sample also features fully distributive systems that have dedicated terms for each of our frames, e.g., such systems are attested in Chinese and Japanese. Thus, frame approach can serve as an effective tool for detecting the degree of semantic overlap between translation equivalents and for representing polysemy in both monolingual and bilingual dictionaries in a structured and comparable way.
In this paper we present a newly developed formal framework (as well as its practical implementation) for automatic, lexically driven analysis of Danish text tokens. The framework (called “CLINK”) employs a minimal token definition (the “morph”) and a compact lexical representation (the “CLINK template”). All morphs (i.e., text elements with individual semantic contribution) are lexicalized using the same template, word forms, affixes, glue elements, puncutation marks, multi-word expressions, etc. Thus, the definition of “lexeme” is reinterpreted in functional-computational terms. The grammar rules of CLINK are purely abstract, viz. those of the Lambek calculus (categorial grammar). This paper gives an overview of the CLINK framework (motivations and application). References to performance metrics will be given (suggesting CLINK to be on a par with the Danish state-of-the-art in PoS-tagging while providing much richer annotation structure). However, we consider the formal framework in itself to be the main contribution of this short paper. CLINK will be available for test runs at EURALEX.
This study presents a project aiming to make thesaurus data available under an academic licence. The project is based on the printed thesaurus Den Danske Begrebsordbog (DDB) which covers approx. 80% of the Danish dictionary DDO (ordnet.dk/ddo). It presents more than 100,000 different words and expressions categorised and ordered semantically in 22 thematic chapters, and 888 named sections. The data is now downloadable at a webpage where it can be supplemented with different types of lexical information from other resources of choice, e.g., information on valency, etymology, or ontological type. The supplementation is possible due to shared sense id-numbers between the lemmas in the digital thesaurus manuscript, the Danish online dictionary DDO, the semantic lexicon COR.SEM, and a WordNet (DanNet). The webpage allows for new types of studies of the Danish vocabulary with semantic similarity as the starting point. As part of the project, more lemmas from the DDO were added to the digital manuscript which today covers 95% of the dictionary. The vocabulary as well as certain sections and lemmas denoting nationality, sexual orientation, gender identity etc. are thoroughly revised due to the change of attitudes towards this vocabulary in the last decade.
We present a study which was carried out with teacher students of mathematics. They were asked to create either dictionary articles or concept maps for terms from an introductory lecture in their first semester. Based on the students’ submissions, we investigate whether there is a difference in the learning outcomes between the two tasks and also whether the technical means used to solve these tasks influence the students’ engagement in the tasks, i.e., whether they chose digital tools or handwriting to complete them. The analysis presented here is based on a first annotated subset of our data and provides preliminary results on our research questions. We show that digital tools seem to be more appropriate to motivate a deeper exploration of the domain. In addition, our analysis suggests that dictionary articles and concept maps motivate different cognitive approaches to the domain, depending on whether the focus is on the concepts themselves or on the possible relations between them.
The publication by Pope Francis’s encyclical letter, Laudato Sì, in 2015 addressing the climate crisis marked a major intervention by the Catholic Church in environmental debates. The letter encompassed multiple topics including climate science, consumerism, throwaway culture, poverty, and integral ecology. Given the global reach of both climate change and Catholicism, effectively communicating the Pope’s concerns about threats to ‘our common home’ to a wide audience required extensive translation enterprise. In institutional translation endeavours, a major challenge is the translation of neologisms (Awadh and Shafiull, 2020). While most large multilingual institutions adhere to explicit guidelines and protocols in order to normalize and standardize terminology for translation (Koskinen, 2011), this does not seem to be the case with the Vatican, which translates into more than 30 languages daily, constituting perhaps the largest translation endeavour worldwide. This case study discusses the challenges in identifying and describing neologisms, as defined by Newmark (1985), across five languages (English, Portuguese, French, Spanish, and Italian) in Laudato Sì, a letter containing 93,185 words. Relying on a corpus-based approach with Sketch Engine (Kilgarriff et al., 2014), we identify candidates to neologisms by searching for quotation marks, terminology lists and hapax legomena. We confirm the neology status by crosschecking with criteria proposed by Cabré and Sager (1999): recent emergence, dictionarization, instability, and user perception. We discuss the hurdles faced in identifying the lexical changes within terms from the environmental domain that acquire new senses in the religious institution, such as the case of ecological conversion, an ecological term whose meaning is to change the state of an environment into another, that for the church means “to undergo a transformation of heart and mind, restoring our relationships with each other, with creation and with God.” Lastly, we address the difficulties arising in case of systematizing these neologisms in a database for internal documentation. Our findings provide insights on the strategies employed to transfer neologisms across languages and demonstrate how the Catholic Church employs and adapt existing environmental terminology to the religious domain to convey its message. Our pilot study also reveals the difficulty of detecting neologisms with current corpus-based tools in either a semi- or fully automatic manner, besides showing that the criteria for determining neologism status needs to be updated in the light of the fast-paced information landscape we have been living on.
Terminology has traditionally focused on denotative meaning, reflecting its historical commitment to establishing clear, universally accepted definitions. However, it has generally failed to acknowledge the presence of connotation within specialized discourse. Drawing from ongoing projects, such as EcoLexicon14 and the Humanitarian Encyclopedia15, we explore the fuzzy boundaries of connotation and denotation through the lenses of terminological variation and corpus information. Although the notion of connotation is well-known and described in Linguistics, the same cannot be said for Terminology (Humbert-Droz, 2024, p. 14). Generally speaking, denotation refers to the literal explicit meaning of a word, often related to that expressed in dictionary definitions; whereas connotation encompasses the emotional and cultural associations that a word carries beyond its denotative meaning. Denotation facilitates knowledge sharing, but connotation must be acknowledged as the driving force that builds, reshapes and renegotiates meaning, since both terms and concepts are naturally dynamic (León-Araúz et al., 2013) and can no longer be regarded as static neutral constructs. Connotations may thus make their way into denotative meaning expanding the scope of a concept as new associations and implications arise. Or they can simply be associated with particular terms in particular contexts, thereby giving rise to other non-connoted term variants until they start circulating in discourse. In Terminology, conceptual variation may be understood in different ways (León-Araúz, 2017). It often refers to meaning extension phenomena (i.e., polysemy, metaphor), but it can also be regarded as one of the causes underlying term variation (i.e., as concepts develop new traits, new lexicalizations arise), or it can be understood as context modulation in the sense of Cruse (1986) (i.e., the concept remains the same but context highlights certain semantic traits while obscuring and suppressing others). Term variation can be the result of different causes other than conceptual variation (i.e., functional, interlinguistic, dialectal, etc.), and variants emerge very often from a combination of different causes. Variation, whether term- or concept-based, can thus be said to cause meanings to split, merge, expand or modulate, blurring very often the distinction between denotation and connotation.
Meaning Split
In the environmental domain, certain concepts have split into two different concepts by keeping the same term (i.e., polysemy) or by creating a new one when the original concept started to integrate new traits. For instance, the original concept of conservation gave rise to the similar new concept of preservation. The choice of one or the other has ideological implications: while conservation is more connected to an anthropocentric concern, preservation denotes a greater concern about the well-being of living things other than the human being. However, the current denotation of preservation would presumably be a connotation of conservation in the past, as it appeared as a splitting movement. In terminological resources, this kind of terms require the creation of two distinct concept entries, but usage notes can be used to prevent their interchangeability.
Meaning Merging
The inverse phenomenon can occur when the terms employed to refer to two close though originally distinct concepts are used interchangeably. This means that two concepts are merged into one, especially in semi-specialized settings. One example is global warming which, strictly speaking, is only one of the consequences of climate change, but both concepts and terms are very often confused. Bush’s administration started to use climate change instead of global warming in order to soften the message (Lakoff, 2010), which may have caused the increase of climate change denial. In terminological resources, climate change and global warming must give rise to two concept entries, but in the climate change concept entry, global warming can be included as a term variant accompanied by a usage note warning about their usual confusion but also highlighting the contexts in which effect and intention could privilege its use.
Meaning Expansion
Concepts, regardless of the term they are given, can evolve on their own by expanding their semantic traits. Their definitions may thus not be stable across the same domain, since they may also vary according to scientific discoveries and paradigm shifts. Going back to the concept of climate change, it is currently clear that human activity is its primary driver. What seems to be a connotation for deniers, who still believe it to be the result of natural cycles, has become part of its denotation for the scientific community. Definitions in terminological sources should thus be adapted accordingly.
Meaning Modulation
Modulation refers to the fact that concepts may change their relational behavior and semantic traits according to cultural contexts, but they might not have necessarily been scientifically expanded. Cultural contexts may include distinctions caused by disciplines, geographic cultures or organizations. Recontextualization is proven to be the best strategy for representing meaning modulation in terminological resources (León-Araúz et al., 2013). Instead of representing all possible dimensions of a concept, conceptual propositions are activated or constrained according to their salience in a particular context (San Martín, 2022), which can be informed based on corpus data. Recontextualization can be applied to knowledge representation modes such as conceptual networks, definitions and graphical resources. All these meaning shifts may alter the boundaries between denotation and connotation, and different strategies can be employed to represent this dynamism beyond the usage note in terminological resources. The content of these notes can be enhanced by including ideological frames, corpus metadata, or any other data supporting the causes underlying variation and their impact in communication. Moreover, other data categories, such as conceptual relations or definitions, can also accommodate a more flexible approach and reflect the fuzzy boundaries of connotation in a contextualized way (i.e., recontextualized networks, flexible definitions).
For Latvian linguists, the study of slang was not a topical matter until 1970. The literary language and dialects have always been perceived as research priority, and the non-literary language was not considered an object of serious scientific work for a long time. There was a more or less pronounced derogation of the non-literary language. Only a few enthusiasts showed scientific interest in it. Recently, research on Latvian slang has taken major steps with the publication of a dictionary Latviešu valodas slenga vārdnīca (Latvian Slang Dictionary) by Ojārs Bušs and Vineta Ernstone in 2006. This study aims to describe the challenges and solutions that have arisen during the development of the unpublished Latviešu vēsturiskā slenga vārdnīca: dzeršana (Latvian Historical Slang Dictionary: dzeršana (‘drinking’)). The analysed linguistic material is compiled from written sources (from the 17th century onwards), speech notes (from the late 1970s onwards), and student surveys (from the second half of the 1990s onwards). Since Latviešu valodas slenga vārdnīca contains mainly lexis from the last 30–60 years, the paper basically focuses on the period from the origins of Latvian slang until the Second World War.
Introduction
Dictionaries, traditionally perceived as linguistic repositories, have evolved to practical tools incorporating a diverse range of features, with one notable addition being the inclusion of pictures (see Gouws et al., 2013; Liu, 2015; Biesaga, 2016; Lew at al., 2018; Dziemianko, 2022, to name just a few). This study delves into the role of pictorial illustrations in monolingual dictionaries. We take pictures in various general dictionaries as a starting point of our considerations aiming to arrive at an outline of the treatment of the visual materials in The Academic Dictionary of Contemporary Czech (https://slovnikcestiny.cz, henceforth ADCC). Particular attention is paid to various types of illustrations, the relation between verbal and visual information, and the question of which meanings should be illustrated.
Status quo
We find pictures mostly in dictionaries intended for reception, and they are a standard feature in learners’ dictionaries. The function of pictures is to provide visual support for the verbal description of the semantic content of language items, with visual descriptions translating information into a form that is more analogous to reality (Svensén, 2009, p. 298). Pictures have a more immediate effect, and their level of abstraction is lower than that of language. Additionally, pictures enhance the learning effect as they appeal to users’ previous experiences of the world and provoke “aha” reactions. Being more concrete, pictures excel in describing particular things and their appearance, whereas language is generally better at describing actions and states of affairs (Svensén, 2009, p. 298). On the negative side, they are space-consuming, expensive, and mostly demand a significant deal of work. Therefore, they should never be included gratuitously. There are several types of visual elements in dictionaries. While photographs are more realistic, drawings can more effectively highlight typical features of an object. Visuals can depict a single object, multiple types within the same class, an object in its normal surroundings, an object in functional operation, actions and processes, objects and terms within a subject field, characteristic aspects of a subject, or environments with typical objects and activities. These typically focus on nouns and, less often, adjectives, verbs, adverbs, etc. (Svensén, 2009, pp. 301313). Regarding the Czech dictionary landscape, the only general monolingual dictionary incorporating visual elements is nechybujte.cz, which offers some problematic pictures: e.g., pulovr (the only definition is ‘piece of clothing’, but the picture shows ‘exercise’) or kufřík (in Czech, a clear distinction is made between briefcase and attaché case, it cannot be the same object, but cf. aktovka).
The Academic Dictionary of Contemporary Czech
There are no visual elements in the ADCC at present, and there is limited information on this topic in the conceptual materials (Kochová & Opavská, 2016; Šemelík et al., 2023). This might be surprising, but the state of Czech lexicography must be considered: the ADCC is the first completely new larger academic monolingual dictionary in over 40 years. Given the prioritization, pictures are considered a “nice-to-have”, not a “must-have”. During the development process, some earlier decisions have been reconsidered. The printed version has been set aside, and the online version has gained prominence, allowing for the incorporation of pictures in the ADCC. The central question is which meanings should be illustrated. Pictures will be added to the ADCC in two phases: In phase one, images will be provided for concrete nouns by linking the entries to Google Pictures. Phase two will focus on more complex concepts (e.g., “being lonely”), for which images are rarely found in other dictionaries, utilizing new AI capabilities currently in the test phase.
Conclusion
The transition to an online format has opened new opportunities for incorporating pictures in the ADCC. The phased approach planned –initially linking to external sources for concrete nouns, followed by the inclusion of AIgenerated images for abstract concepts –demonstrates a strategic method for enhancing the dictionary’s functionality without compromising its scholarly rigor.
Understanding the semantic value of linguistic utterances is crucial for linguistics, lexicography, automatic text interpretation, and various NLP tasks. To address subtle variations within the semantic level, as is well known, machines retrieve stored data from corpora, lexicons and terminologies, and are equipped with taggers and rule-based systems. We already have tools for the development of new lexical resources. Alongside dictionaries, which are excellent repositories of information, corpus managers allow for the retrieval and statistical measurement of the distributive properties of vocabulary and the encoding of its syntagmatic properties. However, by using these tools, it is not always possible to simultaneously search for semantic, formal, and statistical data related to relational and/or categorial meaning, since most of these lack semantic annotation. To address this gap, we initiated a research project, ESMASES+, in September 2023. The main goal of this project is to create an automatic, sustainable, and multilingual semantic annotator. The tagger is conceived to automatically delineate the ontological meaning of nouns in Spanish, French, Galician, and German and resorts to lexical data of previous projects.
This paper introduces “Synonyms in Contrast”, a new online dictionary that addresses the complexities and nuances of neologistic (near-) synonyms in the German language. The emergence of new lexical items, often borrowings from English, has contributed to the proliferation of meaning equivalents. These share a large extent of contextual features, causing ambiguity and uncertainty among speakers regarding their appropriate usage. The dictionary, which is part of the new IDS Neo2020+ resource, distinguishes itself by utilising corpus-driven and corpus-based approached and sophisticated lexicographic methods such as word embeddings to provide users with comprehensive, context-sensitive information on semantic similarities and differences between new synonyms. This resource aims to aid in the clear understanding of new terms, their usage, and interrelations by detailing their semantic overlaps, distinctive contexts, and collocational preferences. Through a combination of empirical data and user-specified presentation, the dictionary facilitates better linguistic decisions in both everyday and also more specialised communication. The ultimate goal is to enhance the integration of new lexical variants into the mental lexicon of German speakers by offering a reliable, dynamic, and descriptive platform for exploring and resolving lexical uncertainties with respect to meaning equivalents that have recently emerged.
This paper combines insights drawn from the ongoing linguistic and ethnographic work in the Americas and Micronesia with lessons from Lichtenberk’s (2003) work on creating dictionaries for languages in transition to explore the utility of biolexicography and other topical lexicographic approaches. Particular attention is given to key features of biolexicography – including biolexica, the role of communities of practice, the use of ethnographic methodologies, considerations of identity, and linguistically-mediated ecological engagement – and to the relevance of this and other topical lexicographic approaches for languages in transition and their speakers. The ways in which these approaches are shaped by the lexicographic inheritance and by engagements with languages experiencing attrition or change are also considered. This discussion illustrates that the utility of biolexicography is rooted in the linguistic and sociocultural significances of biolexica in ways that integrate of lexicographic products into broader cultural and linguistic systems and center speaker communities in the lexicographic process.
Multi-word expressions are a heterogeneous linguistic category which constitutes a significant part of everyday communication and they include linguistic constructions consisting of more than one word, such as idioms (e.g., kick the bucket), binomial expressions (e.g., bread and butter), phrasal verbs (e.g., turn on/off), fixed/conventionalized expressions (e.g., have a nice day) and collocations (e.g., social media) (Wray, 2002; Schlücker, 2019). In this paper, we present a part of the data collected within the scope of the project Procesiranje višerječnih izraza u engleskom kao stranom jeziku (Eng. ‘Processing multiword expressions in English as a foreign language’), the aim of which is to examine and compare the processing strategies in English as L2 and Croatian and Slovene as L1. We analyse the data on native and non-native processing of nominal compounds in the mentioned pairs of languages by investigating the effects of morphological relatedness, frequency and size of morphological family/pattern (Schreuder & Baayen, 1997, Mattiello & Dressler, 2022) and L2 proficiency (Shantz, 2017). In order to compare the factors which affect native and non-native processing of compounds, we will conduct two experiments with masked priming lexical decision tasks (Forster & Davis, 1984) with native speakers of Croatian and Slovene with moderate or high proficiency in English as L2 (for similar research, see Clahsen et al., 2013; De Cat et al., 2014; 2015; González Alonso et al., 2016). In the first experiment, we examine the potential effects of morphological relatedness in nominal compounds, and in the second experiment, we examine the potential effects of schematicity/size of pattern, i.e., whether compounds whose right constituents are merged with a higher number of left constituents are processed faster in L1 and L2. The collection of corpus data that the research relies on has been carried out using the tools available in the Sketch Engine family of language corpora. The initial phase of the research involved the identification and extraction of compounds used in the experiment, with around 50 000 word forms which had to be filtered after having been identified in the corpora based on the morphologically-conditioned extraction criteria. The selection criteria for the corpora encompassed factors such as the target language, corpus size, relevance, recency, and the nature of texts included. Finally, approximately 2000 Croatian and Slovenian compound examples were retained after having been extracted and filtered out from the CLASSLA web corpora (Ljubešić et al., 2024a; 2024b). English compounds were sourced from the ukWaC corpus and only 10 000 of the most frequent ones were retained. Based on the corpus data, we determined the frequency of individual compound constituents and the size of pattern/series used as relevant variables in the experiments. The data on L2 proficiency will be obtained using a proficiency test developed for the purpose of this research. The research presented in this paper aims to answer the following research questions: 1. Does morphological decomposition occur with nominal compounds in Croatian and Slovene as L1 and English as L2? 2. Do frequency measures (frequency of individual constituents, schematicity) affect processing in L1 and L2 and to what extent? 3. Is the effect of L2 proficiency modulated by frequency measures, i.e., are speakers of different L2 proficiency affected by frequency measures in a different manner?
The deictic expressions are some of the most obvious linguistic elements that require contextual information for their semantic interpretation, thus linking the so-called denotational situation (i.e., what is said in an utterance) and the speech situation (i.e., when, where and by whom the elements are used: they form a link between truth-conditional semantics and context-dependent pragmatics. Considering deixis as a pragma-semantic category the present study takes up the issue of how semantic and pragmatic characteristics encoded in deictic signs are presented in lexicographic practice, particularly in monolingual English and Armenian dictionaries, and tries to introduce the ways of tackling problematic entries. Deictic words often pose problems for the dictionary representation of word meaning because of their specific semantics. The very specific character of deictic semantics makes it reasonable and necessary to rely on its uniqueness when lexicologically processed: as all deictic expressions share typical properties, they can be similarly explicated in the dictionary definitions. Proceeding from the assumption that lexicographers should take into account the type of meaning the word conveys, a cognitive-oriented research has been undertaken as to how adequately and systematically dictionaries integrate semantic and pragmatic information into the definitions of such specific type of language units as deictic words. Cognitive linguistics is claimed to be a good tool to offer a useful framework for lexicographers and provide certain kinds of structured background information (Fillmore, 2003), which can help to create systematic and well-grounded dictionaries with comprehensive information about specific words and their meanings. The focus of the research is on the following questions: (1) How deictic semantics should be reflected in dictionaries? (2) Which form of lexicographic reflection is preferable and adequate? (3) Which concrete phenomena should be addressed specifically? From the methodological point of view, we rely on modern trends in lexicography based on the concept of compiling dictionaries with a systemic descriptive approach (Apresjan, 2001). The method draws on the detailed semantic grouping of deictic expressions into different semantic types according to the degree of the word’s deicticity. In our research, we claim that the meaning of deictic terms is multilayer: it includes a semantic layer proper with its designative component, or ‘value’, a specific pragmatic layer that points to the speech-situational factor this value is relative to and presupposes its reference point, and a more general layer – that of part of speech belonging which is also indicative of some degree of deicticity (Yerznkyan, 2013). We also discriminate subclasses of deictic expressions that are distinguished semantically, structurally, and pragmatically and suggest that they are accordingly introduced in dictionaries based on a system of parameters, which characterize each lexical unit as to the degree of their deicticity. The empirical evidence from earlier and contemporary English and Armenian dictionaries suggests that they show few signs of having directly addressed this issue. A notable difference observed in the overall number of descriptors/ markers of deicticity proves that this issue has not been adequately covered. The explicit presentation of semantic and pragmatic information in the definitions of deictic words in the dictionaries under study lacks uniformity. And as the lack of information about the composition of deictic meaning might be misleading and confusing our research calls for a reconceptualization of the lexicographic principles when dealing with this specific class of signs with intrinsic pragmatic loading. The achievement of this goal presupposes the elaboration of universal criteria for presenting the deictic signs as a system, for one of the main features of this class of words is its systemic character. In our research, we have singled out some components of structural, semantic and pragmatic information typical of deictic signs, which should be systematized and standardized in order to be adequately presented in a dictionary and facilitate the speakers’ ability to improve their communicative competence. We propose to tackle this issue through four main parameters and suggest the following classification of deictic signs into several groups: pure or impure (semi-deictic) deixis – depending on the type of nomination (pointing and/or naming) employed in each case, subjective or objective deixis, depending on the type of deictic orientation, implicit or explicit deixis – depending on the mode of expression considered from a morphological perspective, and abstract or concrete deixis – depending on the type of the reference point (Yerznkyan, 2013). As solving the identification problem in deictic reference involves primarily the setting of a basic reference point, the deictic centre, or the ‘Origo’ in K. Buhler’s terminology (Buhler, 1990), dictionaries can deal with it as the most fixed and indispensable part of deictic semantics: the way the intended referent is to be decoded should be signalled in the dictionary definition of deictic words. Thus, what we need to do is to organize that information according to the deictic categories with special reference to the type of deictic sign and the degree of their deicticity taking into particular account the interaction between lexical semantics and pragmatics. We believe that this type of information is linguistically (and lexicographically) relevant, as it totally governs the use of deictic expressions in speech. Admitting that the reference point (the Origo) we spoke about is the lexicalized pragmatic component, which is directly embedded in the meaning of deictic words and has a constant systemic status in language, it should be systematically integrated and accordingly defined in the dictionary definitions as a marker of deicticity. Thus, we believe that any attempt at a detailed semantic analysis of the actual linguistic data in its complexity may help to adequately present it lexicographically. Deictics are one of those areas where semantics can largely contribute to lexicography.
This article presents the compilation of entries for several foreign languages, namely English, Italian, Latin and German of the Contemporary Slovene Dictionary of Abbreviations (CSDA). The material for the compilation of CSDA has been collected in a time frame of twenty years, both manually from monolingual, bilingual, general and terminological dictionaries (always paired with the Slovene language) and automatically, by using the algorithm for automatic recognition of abbreviations and expansions in electronic texts (Kompara Lukančič, 2010; 2011; Fabijanić, 2014; 2015a; 2015b), developed by the author for the Slovene language in 2011. The algorithm was prepared in line with the outcomes of the research in the recognition of abbreviations from Taghva & Gilbreth (1999) for the English language and it was adjusted to meet the structural features of Slovene abbreviations and their pronounced characteristic in texts. Apart from Slovene abbreviations in the Contemporary Slovene Dictionary of Abbreviations (CSDA), abbreviations in over 10 foreign languages had been included, the largest number in English, followed by Latin, French, Italian, German, Spanish, Croatian, etc. In 2023, the Slovene dictionary of abbreviations – Slovenski slovar krajšav (Kompara Lukančič, 2023) was released, after the compilation method and the microstructural features of entries had been explained (Kompara Lukančič, 2015; 2017). In 2023, the structure and characteristics of the English entries were presented following an analysis of the layout and characteristics of entries in English dictionaries of abbreviations and Slovene terminological dictionaries (Kompara Lukančič, 2023; Malenica & Fabijanić, 2013; Fabijanić & Malenica, 2013). The compilation process of Slovene dictionary entries (Kompara, 2015), where only expansions, field qualifiers and possibly some additional information are given, differs from the compilation of English entries (Kompara Lukančič, 2023) or entries in other languages, such as Italian, Latin, German, which are composed of language qualifiers, expansions, translations into Slovene and in some cases the Slovene abbreviations. In compiling the English abbreviations dictionary entries (Kompara Lukančič, 2023), the structure of the English abbreviations dictionary entries from the Slovene terminological dictionaries was used, which is composed of: a headword, followed by the abbreviated language qualifier, expansions in English, Slovene translation, or a translational equivalent and Slovene abbreviations. Based on the structure explained in the compilation process for Slovene and English entries (Kompara Lukančič, 2023), in the present article we present the compilation of entries in English, Italian, German and Latin. Based on the presented examples, microstructural elements were determined, and applied to entries of English, Italian, Latin, and German abbreviations of the Contemporary Slovene Dictionary of Abbreviations (CSDA). As already explained in the compilation process for the English entries (2023), the used structure is a good example of the compilation steps and could be applied also to other foreign languages. As visible in Table 1, we present the characteristics of the entries, in relation to the usage of language qualifiers (e.g., angl. – ‘English’), field qualifiers (voj. – ‘military’), additional data (e.g., (oskar) najbolj prestižna filmska nagrada v ZDA – ‘(Oscar) the most prestigious film award in the USA)’, the equivalent Slovene abbreviations (e.g., KZ) etc. In the dictionary, we aimed at including cross-references too and we drew attention to the characteristic and typology of Latin, Italian and German abbreviations in relation to Slovene and English abbreviations. Such differences were important in the collection and preparation of the material for the dictionary compilation as well as for the automatic recognition of abbreviations in texts. Here we would like to highlight the typology of the Italian abbreviation aff. For which in the recognition process the expansion in the text was not encountered, similarly also for Slovene abbreviations, e.g., itd., itn., npr., and the Latin abbreviation A which has a variety of additional usage, e.g., a., an., and also the Slovene equivalents (see Table 2). In the paper, we present examples of good practice that will help in future preparation of the dictionary entries for other languages, namely German, Italian, Latin etc. as part of the Contemporary Slovene Dictionary of Abbreviations.
This study examines the role of the BLD and the user’s receptive vocabulary knowledge in making appropriate lexical decisions in a sentence completion task. Fifty-two advanced and low-proficiency learners were recruited for the study. They were asked to read a set of gapped sentences and fill in the blanks with semantically similar verbs, using a randomly assigned dictionary (either BLD or MD depending). A logistic mixed-effects regression model was developed in which a participant’s choice of a verb was treated as a success or failure in the task. The model showed that the successful completion of the task significantly depended on the learner’s vocabulary knowledge rather than on the type of dictionary. Likewise, another regression model (two-way ANOVA) revealed that the time taken to complete the task was significantly affected by learners’ vocabulary knowledge but not by dictionary type.
This paper introduces the Erasmus Mundus Joint Master in Lexicography – EMJM-EMLex – especially some new developments and objectives. The EMJMEMLex programme remains focused on lexicography, but is evolving into a multidisciplinary, digital discipline in order to adapt to societal and scientific changes. Founded in 2009, experience to date confirms that the success of EMLex lies in its unique objectives and in responding to the ever-changing needs of society while fostering international collaboration, linguistic diversity and cultural exchange. The programme’s multidisciplinary approach is in line with the Council of Europe’s language policy goals, emphasising lifelong learning and multilingualism. With 8 full member universities and 8 associate members in its consortium, the programme offers an international and interdisciplinary curriculum covering lexicography, digital humanities, language technologies and natural language processing. EMJM-EMLex integrates theoretical and practical aspects of lexicography over four semesters, including mobility periods, advanced modules, internships and a master’s thesis, after which graduates receive a joint degree from the member universities. EMJM-EMLex graduates are prepared for diverse career opportunities. The programme’s commitment to innovation, internationalisation and sustainability ensures its continued relevance and impact in the digital age.
This paper shows research potential of the virtual lexicographic laboratory VLL DLE 23 based on the text of the Spanish Explanatory Dictionary (DLE 23). Virtual Lexicographic Laboratories (VLL) is the effective tools for linguistic researches based on dictionaries. The lexicographic text is considered not only as a basis for dictionary creating and updating but also as a means of professional communication and transfer of linguistic knowledge. Primarily, this applies to explanatory dictionaries, which are characterized by a detailed and multi-aspect language units description. This arises the problem of providing such dictionaries with appropriate tools, that enable to extract any linguistic information from the text of the dictionary during linguistic research. This paper describes the experience of creating such tools during the implementation of the VLL DLE 23 project, a Virtual Lexicographic Laboratory based on the Spanish Dictionary “Diccionario de la lengua española. 23ª edición” (https://dle.rae.es/), published by the Royal Spanish Academy. The current version of the VLL DLE 23 can be accessed at https://svc2.ulif.org.ua/Dics/ResIntSpanish. The VLL DLE 23 project was implemented in three stages. At the first stage, the text along with the HTML-markup was (partially) extracted from the available online version of the dictionary. At the second stage, dictionary text was analyzed in order to identify the informational elements of the entries. At the third stage, a model of the L-system was built, which formally displays DLE 23 information elements and serves as the basis for creating a database and interface. The current interface enables to generate statistics for the entire dictionary or for a certain selection of dictionary entries, to conduct linguistic researches of lexical meanings, etymology, grammar, and the peculiarities of the Spanish language units usage, as well as to create derivative dictionaries based on DLE 23, for example: dictionary of morphemes, dictionary of homonyms, dictionary of word combinations etc. The VLL DLE 23 interface provides the following modes of work with the dictionary: a) dictionary register; b) dictionary entry profile; c) full-text search. The dictionary register allows user to select a headword either by clicking on it in the list or by typing a sequence of characters that exactly matches the word or word combinations they are looking for. Work with the dictionary registry is provided by filters, such as “starts with”, “ends with”, “exactly”, “contains”. This mode resembles working with the online version of the dictionary. Dictionary entry profile is a mode of VLL operation in which it is possible to create samples of dictionary entries, for which user activates the dictionary entries elements and selects the meanings of these elements. The user can select a specific type and structure of headwords by checking the appropriate box. There is also an additional option to include homonymous words in the sample. The current version allows to study: 1) types and amount of words in the dictionary list: morphemes (prefixes and affixes), words and word combinations; 2) word-forming characteristics, i.e., masculine and feminine forms, headword doublets; 3) phenomena of unambiguity, ambiguity, and homonymy of headwords; 4) Spanish language vocabulary by origin (specific and borrowed vocabulary); 5) presence or absence of word combinations, formed with headwords. Figure 1 shows the options that must be selected to a sample of dictionary entries containing register words of foreign origin. Full-text search. This mode is used to select dictionary entries by certain elements of the DLE 23 metalanguage or by a certain text fragment. It is also possible to make a selection of dictionary entries containing one or another fragment of the explanation. Full-text search can be used with “Dictionary entry profile” mode. Process of expanding the research potential of the VLL DLE 23 is ongoing. In the near future, it is planned to index the remaining elements of the dictionary entries discovered at the previous stage of the project implementation. For working with the dictionary text in a digital environment, it is necessary for all of its informational elements, which may be of interest to the linguist during their research, to be accessible.
Soup-to-Nuts11 is a program to automatically induce a lexicon of MultiWord Expressions (MWEs) from a corpus and re-tokenize the corpus based on the MWEs that were found. I will discuss how the program works, and give a demo to show how it performs.
Choueka (1988) was the first to induce a phrasal lexicon from a corpus, and that work is highly related. Bigrams and longer n-grams (up to six-grams) were extracted from a 10-million-word corpus of the New York Times. The n-grams were processed using an algorithm similar to my own: rejecting candidates with a closed-class word, identifying and rejecting chunks, and rejecting candidates that contain numbers, dates, and times. Schone & Jurafsky (2001) also induced a multiword lexicon from a corpus. They used a 6.7-million-word subset of the TREC databases (a set of corpora that are used for evaluating information retrieval systems). These datasets are too small. To put things in perspective, McKeown et al. (2017) mention that starting in fourth grade, the average frequency of the words to be acquired is one per million tokens, or less. MWEs are usually less frequent than individual words, making the problem even worse for vocabulary acquisition of MWEs. Frequency is a problem both for computational acquisition of MWEs as well as for acquisition by children. There are many factors that are important for inducing a phrasal lexicon. I compared three lexical association metrics: Pointwise Mutual Information (Church & Hanks, 1990), Log Likelihood (Dunning, 1994), and Mutual Rank Ratio (Deane, 2005). I found that Mutual Rank Ratio (MRR) was the best for inducing a lexicon. The expressions that were highly ranked using Pointwise Mutual Information were very infrequent, such as pprox harengus, balaena mysticetus, secale pprox, and erodium cicutarium. They were expressions that can occur in a dictionary, but they were mostly unfamiliar. In contrast, the expressions that were highly ranked by Log Likelihood included of the, for the, and such as. The most highly-ranked expressions using MRR included prime minister, science fiction, San Francisco, and human rights. Filtering candidates that start or end with a closed-class word,12 and filtering inflectional variants, made a major
improvement in recognizing good candidates. However, even with this filtering, none of the association metrics were effective when they were evaluated on a large corpus, and the frequency threshold was set so as to capture at least 50% of the attested MWEs. This was true for two different dictionaries that were used as a gold standard. I found it was essential to use multiple corpora. I evaluated four datasets: 1) a download of the Wikipedia; 2) a download of 30,000 books from Project Gutenberg; 3) Medline, a corpus of titles and abstracts from the biomedical literature; and 4) Juris, which is a corpus of legal text.13 Candidates with an MRR score greater than 2.0 were identified that appear in two or more corpora, and the candidates were re-ranked by dispersion (the number of corpora that the candidate occurs in). Those candidates that occur in all four corpora were
ranked first, then the candidates that occurred in three, and then those in two.
This more than doubled the average precision.
In addition to the association metric, we also need to consider morphological variation. This is important in selecting a normal form for the headword, and in identifying and rejecting chunks. A chunk is a part of a multiword expression that does not have an identify of its own. For example, Osama bin, and Leonardo da. In contrast, bin Laden and da Vinci are not chunks because they have an independent existence. We can Look for bin Laden, or we can say I saw a da Vinci in a museum. We need to recognize that automated teller is a chunk because it is a component of automated teller machine, and Yellowstone National is a chunk because it is a component of Yellowstone National Park. But we also see automated teller machines, and Yellowstone National Parks because of the context Grand Teton and Yellowstone National Parks. Sometimes plural forms
occur more frequently (e.g., civil rights, human rights). I use the most frequentform, singular or plural, as the norm for an individual corpus, and the form that has the greatest plurality in creating a common inventory. In tokenization, all morphological and orthographic variants are treated as an equivalence class. There are at least three types of chunks: 1) Expressions like Osama bin and Leonardo da that are a part of a single longer expression; 2) Expressions like density lipoprotein that are a part of high density lipoprotein and low density lipoprotein. Both of these expressions are supported by acronyms, and they are important in biomedical natural language processing; 3) Expressions such as Institute of Technology, and Bureau of Investigation. They represent a generative pattern in which there is a typed variable that is a part of the longer expression (e.g., Indian, Massachusetts, National, Federal). Dictionaries often include typed variables such as someone or something in their definitions, and these examples
illustrate other generative patterns that can occur in MWEs.
The most important issue to address is compositionality. There is previous work on using machine learning to help with this (Roberts & Egg, 2018; Cordiero et al., 2019), but I believe a larger and more focused effort is needed. I am in the process of preparing a dataset that is divided into strong-idioms (hot dog), weak idioms (room temperature), and compositional expressions. The compositional expressions are annotated with the type of relationship using Lexical Functions (Mel’čuk, 2023), and Qualia relationships (Pustejovsky, 1998). The aim of these annotations is to increase confidence that the MWE candidate does or does not belong in the lexicon.
The primary purpose of re-tokenization is to create associations between MWEs and MWEs (e.g., Abraham_Lincoln and Emancipation_Proclamation), and between MWEs and individual words (e.g, Abraham_Lincoln and slavery). The goal of creating these associations is to develop better methods for teaching and assessing vocabulary. Most work on vocabulary assessment focuses on breadth. Previous work has used lexical associations for assessing depth of knowledge as well (McKeown et al., 2017). I am extending that work to include associations involving MWEs. The demo will show some of the associations that were identified using the program.
Soup-to-Nuts is open-source and will be distributed via GitHub.
In this paper, we present the search and visualization interface of the Croatian derivational lexicon ‒ CroDeriv. CroDeriv contains information on the derivational and morphological properties of Croatian lexemes. Each lemma in the lexicon is enriched with its word-formation analysis and morphological segmentation. The search interface enables simple and advanced queries, i.e., by lexemes, by morphological structures, and by word-formation patterns. Moreover, the visualization interface enables the graphical representation of derivational families.
The incorporation of images in dictionaries has been addressed in several papers (cf. Biesaga, 2016; Klosa, 2015). Estonian lexicography has a long tradition of including visual materials in learners’ and terminological dictionaries. However, until recently, there was no picture dictionary for learners of Estonian as an L2 that is accessible as a separate resource and simultaneously linked to the Estonian L2 learners’ dictionary. To address this gap, the multilingual Estonian L2 picture dictionary was compiled. Below, we outline the basic methodological implications.
Languages
Estonian, Russian, and Ukrainian
Target Group
The dictionary targets both young and adult learners at the elementary (pre-A1–A2) and intermediate levels (B1). For preschool children, there is also an option to toggle between capital and small letters (see Figure 1).
Vocabulary and Topics
Currently, the dictionary contains pprox.. 1000 images (drawings) pprox. into 52 themes: vehicles, fruits and berries, flowers, birds, insects, sports, clothing, etc. The dictionary includes nouns, adjectives, verbs and prepositions. We opted for drawings over photos, as they offer greater flexibility for pprox.ngn, allowing us to better capture the cultural and national specifics of Estonia.
As a starting point for determining which words to include, we used the Estonian Vocabulary Profile (Kallas et al., 2021), which provides separate wordlists for all CEFR levels for young and adult learners. This allowed us to divide the vocabulary of the same topic into two proficiency levels. For example, the topic ‘weather’ includes 12 images but, at the elementary level, we show just four (sun, wind, snow and rain). Additionally, for young learners at the pre-A1 level, we considered curriculum requirements to ensure the dictionary reflects both everyday and educational contexts.
The interface allows users to search for a word or a part of word, and browse topics in three layouts: i) separate images (Figure 1); ii) scenic images (Figure 2); iii) image series (Figure 3). Scenic images are intended for learners to study individual words first and then describe a scene using those words. Image series are particularly useful for pprox.ng spatial and sequential relations.
Usage Examples
Each image is accompanied by usage examples specifically chosen to illustrate the content, visible when the user flips the card. Sentences are compiled to assist the user in describing the picture.
Linking with Other Sources
Each image in the dictionary is linked to the corresponding word in the dictionary portal Learners’ Sõnaveeb (Koppel et al., 2019), which is the user interface for the Dictionary Writing System Ekilex (Tavast et al., 2018). The user must click on the Sõnaveeb icon located at the upper right corner of the image. This linking, based on lexeme IDs in Ekilex, allows the user to explore word meanings, synonyms and collocations, and learn how to decline or conjugate words.
Audio Files
The Estonian version of the dictionary is audio-enabled, allowing the userto hear the pronunciation of all words and sentences. Professional actresses were employed for the pronunciation of words, while text-to-speech synthesis developed by the Institute of the Estonian Language was pprox. for sentences.
Data Format and License
The data are stored in JSON format and can be used under the CreativeCommons BY 4.0 license, with audio files in WAV format and images in SVG or JPG format.
In the near future, we plan to add English and French, and to incorporate audio files for languages other than Estonian. We also envision creating games, such as matching images with corresponding words or sounds. Additionally, we aim to explore generative AI systems, such as image and audio generators, to enhance the creation of picture dictionaries and supplement learner dictionaries with multimedia content.
Since February 2022, a war has erupted in the middle of Europe, in Ukraine, with serious consequences for all Europeans and, from a global perspective, shaking an existing world order. In reporting on this war, numerous phraseological units are emerging, including many proverbs that refer to the concept of WAR and are highly interesting for linguists because they demonstrate how expressive and emotional language users react to societal and political challenges and how they attempt to represent war narratives in language.
This study delves into the intricate world of phraseology used in German and Ukrainian press texts to refer to the concept of WAR during the Russia-Ukraine war, seeking to unravel the rich tapestry of expressions that encapsulate the multifaceted nature of armed conflicts. Through an examination of linguistic nuances, historical contexts, and cultural influences, this corpus-based research aims to shed light on the unique lexical landscapes that have evolved around the theme of war in these two distinct linguistic traditions.
Methodologically, the empirically supported study combines analyses on various levels: cognitive-linguistic analyses of imagery, word formation analyses (constituent analyses), analyses of contextual embedding, and contrastive analyses considering translations. Where possible, the data is also quantitatively evaluated. The Ukrainian corpus is sourced from Slovnyk Ukrainskoi Movy, Myslovo, the online newspapers Ukrainska Pravda, Unian, and the online media Deutsche Welle (Ukrainian). The German corpus is derived from Digitales Wörterbuch der deutschen Sprache, das Kleine Lexikon: Krieg und Sprache, the online newspapers Zeit Online, Welt.de, and the the online media Deutsche Welle (German). In total, the corpus contains 95 Ukrainian and 58 German phraseologisms. The analysis is based on a comparative corpus consisting of 525 German and 525 Ukrainian press reports each.
Phraseologisms are defined in this research paper as fixed word complexes of various syntactic structure types with singular linkage of components, whose meaning arises as a result of complete or partial semantic reinterpretation or transformation of the component elements. They are mainly characterized by polylexicality, reproducibility, and (optionally) idiomatization (Burger, 2015, pp. 11–32).
The research begins with a comprehensive exploration of Ukrainian phraseology used in media from 24th February 2022 to 31st March 2024 in mass media, drawing from historical events, folklore, and contemporary narratives. The Ukrainian language, deeply rooted in a history marked by conflicts and struggles for independence, boasts a lexicon that reflects the resilience and courage of its people. From proverbs (Хочеш миру – готуйся до війни; Буде тобі враже, як Залужний скаже) to phraseological units (роздмухувати вогонь, стояти на смерть, зайти у глухий кут, перетнути червону лінію), the study uncovers linguistic expressions that provide insights into the Ukrainian psyche, shaping a narrative that goes beyond mere words to embody the collective experiences of a nation.
In parallel, the research extends its focus to the German language, a linguistic realm that has been shaped by the turbulence of Europe’s history. Traversing through phraseologisms (jmd freie Hand lassen, etwas im Schilde führen, im Visier stehen) deeply embedded in German culture, the study explores how phrases related to war have evolved over time, intertwining with the nation’s complex relationship with its militaristic past. Drawing from mass media and dictionaries, the analysis aims to illuminate the nuanced ways in which the German lexicon reflects a society’s grappling with the repercussions of war and the pursuit of peace. A comparative analysis follows, juxtaposing Ukrainian and German phraseologies to identify similarities, differences, and cross-cultural influences. In the result of the contrastive analysis of the corpora, it can be observed that the Ukrainian corpus is richer, more vivid, and more diverse compared to the German one. Unlike German, the corpus contains many phraseological neologisms (приміряти чорні пакети, придбати квиток на концерт до Кобзона).
Remarkable is the fact that both in Ukrainian and German, the color terms black, green, and red appear in phraseological expressions (дати зелене світло/grünes Licht geben, перетнути червону лінію/eine rote Linie überschritten, чорний, як сама смерть/schwarz wie der Tod). The comparison of phraseological pairs reveals two main types of equivalence, which demonstrated in Table 1.
The conclusion synthesizes the findings, emphasizing the dynamic nature of linguistic expressions related to war in Ukrainian and German. The lexicons of these languages are not static; rather, they evolve in response to societal shifts, historical reflections, and geopolitical developments. By unraveling the phraseologies surrounding war in these languages, this research not only provides valuable insights into linguistic structures but also contributes to a deeper understanding of the complex interplay between language, culture, and conflict.
This software demonstration presents a data model and a first use case for the representation of text corpus data on a Wikibase instance, including morphosyntactic, semantic and philological annotations as well as links to dictionary entries. Wikibase, an extension of MediaWiki, is the software that underlies Wikidata, an exceptionally large crowdsourced queryable knowledge graph, which includes nodes for ontological concepts, on the one hand, and for lexemes, lexeme senses and lexeme forms, on the other, together with annotations to and relations between them. We argue that the proposed model and the chosen software solutions for the representation of corpus and dictionary data, all free and open source, meet with the requirements of provenance transparency, open access and re-use, and the capability of collaborative work on the data. We also present our own scripts wrapped in a web application that shortcut several workflow steps
in a first use case, a 1737 Basque manuscript, transcribed on Wikisource, and represented as an annotated dataset on our Wikibase instance.
We present a demo of MWE-Finder, an application that enables a user to search for (flexible) multiword expressions (MWEs) in Dutch text corpora (Odijk et al., 2024). We will show many different examples in the demo, but here we show one example.
A multiword expression (MWE) is a word combination with linguistic properties that cannot be predicted from the properties of the individual words or the way they have been combined by the rules of grammar (Odijk, 2013). Many MWEs in Dutch are flexible in the sense that the component words can occur in multiple forms, are not necessarily adjacent, and do not always occur in the same order. This makes it difficult to search for such MWEs with most existing query systems, but MWE-Finder has been specifically made to deal with this.
The targeted users of MWE-Finder are linguists and lexicographers who want to investigate properties of Dutch MWEs. A user can enter an MWE in canonical form. A canonical form is a unique form that represents a set of linguistic objects, in this case a set of variants of the MWE (different component forms, different word orders, etc.) The notion canonical form for MWEs has been defined in Odijk & Kroon (2024). Users can enter their own MWE in canonical form or select one from more than 11k canonical forms
that MWE-Finder offers. These canonical forms are mostly based on the native speaker intuitions of the creator of the resource. This canonical form can be seen as a hypothesis about the properties of this MWE. In particular, by using this canonical form it is stated that the word dans ‘dance’ cannot be modified and that it must be accompanied by the determiner de ‘the’.
We call this MWE the target MWE and when it is entered, MWE-Finder automatically generates three queries: MWE Query (MEQ) this searches for the MWE;
Near Miss Query (NMQ) searches for the content words of the MWE with the
grammatical configuration they are in in the MWE;
Major Lemma Query (MLQ) searches for the content words of the MWE
ignoring the grammatical configuration.
These queries are increasingly less strict. The user can now select the treebank to search in. MWE-Finder offers many treebanks, and users can also upload their own corpora, which are turned into treebanks and made available for search. We select the Mediargus treebank.
The results of the queries are presented to the user on the screen as they come in. For the MWE as in (1), the MEQ finds 1158 hits in over 103 million sentences. The NMQ finds 1271 hits. If we exclude the results of the MEQ (an option that MWE-Finder offers), we quickly see in the 131 remaining hits that the target MWE occurs in variants not predicted by the canonical form that we started with, because the word dans can occur with a variety of modifiers and determiners.
This suggests that the canonical form that we started with was too strict. We must allow for modification of the MWE component dans and the article de is not a component of the MWE. A better canonical form would be iemand zal dd:[de] *dans ontspringen, in which the code dd:[..] surrounding de indicates that de can be replaced by other definite determiners, and the * before dans means that it can be modified. In this way, we can improve upon an initial canonical form mainly based on native speaker intuitions by systematically taking into account corpus data. MWE-Finder makes this possible in a very efficient and user friendly way.
The MLQ finds 1309 hits. If we exclude the results of the NMQ, we have to inspect 38 examples. These are mostly valid instances of the MWE de dans ontspringen that have been wrongly parsed, but we also find a variant of the MWE, viz. (2), for which we can now add a canonical form to our lexicon of MWEs in canonical form.
In this way, a linguist or lexicographer can easily and efficiently investigate the properties of Dutch MWEs, and improve the description of Dutch MWEs.
The software demonstration presents a new lexicographic resource for Lithuanian the Lexical Database of Lithuanian Language Usage (further on, database) particularly focusing on the Collocate Search and presenting its functionality by several examples.
The Lexical Database of Lithuanian Language Usage is the first corpusdriven lexical database for Lithuanian (Kovalevskaitė et al., 2021). The material for the database was collected from the written part (620,000 words) of the morphologically annotated and CEFR-graded Lithuanian Pedagogic Corpus. The corpus was developed for learning purposes and consists of texts collected from the Lithuanian language coursebooks and a variety of authentic learner-relevant Lithuanian data (Kovalevskaitė et al., 2020, p. 246).
For the description of word usage, the inductive procedure of Corpus Pattern Analysis (Hanks, 2013) was adopted, which was partly automated using the Lithuanian Sketch Grammar in Sketch Engine (see Kovalevskaitė et al., 2020). Although the corpus we used is rather small, the usage information on the lexical and grammatical patterns was collected for frequent words (frequency of 100 and above) from the core vocabulary, i.e., words that appeared in all CEFR levels (from A1 to B2) or at least in three levels (Kovalevskaitė et al., 2020, p. 247). The final headword list of 3700 items includes appr. 700 words (nouns, verbs, adjectives, adverbs) of high-frequency, and word formations and multi-word expressions from the core vocabulary related to these frequent words.
In the database, usage patterns are associated with specific meanings of the headword (e.g., the 3rd meaning of the headword BĖGTI (‘to run’) is pprox. by one usage pattern 3.1., see Figure 1). After selecting a specific pattern, a three-colour table is displayed, in which the individual columns represent the grammatical (marked in blue), semantic (pink), and lexical levels (purple) of the pattern. As for the 3rd meaning of the verb bėgti, the information at the grammatical level shows that this verb in the present tense form (marked as BĖGTI_prs) is typically used with the subject (denoting an agent) and the adverbial of manner (marked as Adv). At the semantic and lexical level, we can additionally learn that the agent in this model is abstract (usually expressed by a collocate laikas ‘time’), and the adverbial of manner (‘būdas’) is expressed by the collocate greitai
(‘quickly’). The multilevel representation of a pattern contains a lot of usage data, however, from the point of user-friendliness this representation still needs better solutions.
Information at the lexical level of usage patterns provides collocates, whichare defined as words commonly used with the headword. Due to the small corpus size, we do not evaluate collocates by statistical significance: a word is considered to be a collocate of the headword if they co-occur 3 or more times in the subcorpus. In the database, there is a special Collocate Search function where the user can find words for which the search word is a common collocate. For example, the search results for the verb sveikinti (‘to congratulate’) will display the noun proga (‘occasion’). The full record of the noun proga (‘occasion’) contains one meaning with 8 patterns (accessible via Headword Search), however, the Collocate Search will result in a pattern 1.1, where the searched collocate sveikinti (‘to congratulate’) is used (Figure 2).
The expanded information of the pattern 1.1 shows (see Figure 3) that when the noun proga (‘occasion’) in singular instrumental case is used with an attribute that does not agree with the noun (gimtadienio proga (‘on one’s birthday’), this phrase refers to a [reason] (‘priežastis’) on the semantic level. Thus, a complete pattern is: sveikinti gimtadienio proga (‘to congratulate someone on their birthday’).
If a headword is polysemous, then the user will see it from the particular usage pattern, e.g., the search noun laimė (‘happiness’) is a collocate in the 2.1. pattern of the headword NEŠTI (‘to carry’), when this verb is used in its 2nd meaning, e.g., neša laimę (‘brings happiness’). Grammatical level indicates that this verb is used with object in accusative:At present, the presentation of collocations in the database is useful mainly for decoding function. However, to enable foreign learners to use semantically transparent collocations productively it is also important to present them as units (Siepmann, 2008). Improving the resources for learning Lithuanian collocations some other features may be worth taking into consideration, e.g., possibility to explore connectivity between collocates at various levels of collocation networks (e.g., Brezina et al., 2015) and links to language proficiency level.
This software demonstration presents approaches to employ Wikibase in a university course on Terminology, and results of terminology projects lead by students. Wikibase, an extension of MediaWiki, is the software that underlies Wikidata, a very large crowdsourced queriable Knowledge Graph. We use an own Wikibase instance for cloud-based collaborative student projects on Terminology in Basque, a European minority language spoken in Spain and France, and coofficial in some regions of Spain. We show that the course work with and about this software covers a great deal of what a course in Terminology for students in the translator and interpreters training should cover; it embraces relevant facets of knowledge engineering, and the acquisition of the features of a graph database and its user interface. The datasets created in student projects and
partially shared through Wikidata are examples for a low-barrier crowdsourcing terminology workflow, eventually useful in other contexts as well.
The Czecho(-)17Slovak Word of the Week was a joint year-long popularization project of the Institute of the Czech National Corpus and the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences, which was inaugurated on the occasion of the 30th anniversary of the dissolution of Czechoslovakia (January 1, 1993). Throughout the year, each week, a new entry written in parallel in Czech and Slovak was published on the project website18. We intended to draw the attention of both the Czech and the Slovak public (especially the younger generation, for whom the former mutual intelligibility between the two languages no longer holds) to the interesting parallels, but chiefly the differences, between our two languages. We tried to do so in a user-friendly and entertaining way, the central part of each entry being a language feuilleton (a very popular genre both in Czechia and Slovakia), supplemented with data drawn from language corpora (SYN2015, SYN2020, and ORAL v1 for Czech19; prim-10.0-public-all and s-hovor-7.0 for Slovak20) and the respective entries from some (mostly older) monolingual
and bilingual dictionaries (Bernolák 1825, Jungmann 1835–1839, SSJČ 1960–1971, SSJ 1959–1968, KSSJ 2003, ČSS 1981, SČS 1967). In a way, we see our website as a dictionary with a fixed macrostructure (52 entries, including some multi-word units) and a microstructure determined by the order of the individual components. The target audience is presented with various lexicographic information – be it frequency statistics for various text types, examples from both written and spoken corpora, or quotes from older dictionaries – unobtrusively, covertly, and usually without them having the feeling that they are “leafing through” a dictionary.
Our demo presentation is focused mainly on various non-technical aspects of the project, such as project team setup, workflow, promotion, responses from the public, etc. Our team was primarily comprised of external writers of feuilletons, mostly linguists. Their texts were edited, proofread, and supplemented with information from corpora and dictionaries so that each entry had the same microstructure. In addition, a programmer and a graphic designer were necessary to implement the project successfully. Altogether, there were more than 30 people involved in the project. In addition, the feuilletons were regularly printed by prominent Czech (Deník N21) and Slovak (Denník N22) dailies and featured on
their websites, disseminating our effort among thousands of extra readers.23
The project website’s access logs indicate a decrease in traffic after the project was officially completed in December 2023. The “Unique visitors” value (based on the IP address) displayed by the log visualization program is more or less irrelevant here, as commercial Internet providers in both the Cech Republic and Slovakia usually assign the same IP address to many different users. Nonetheless, several thousands of hits peaking on the day of the weekly publication can be observed. These statistics, however, do not include access via the respective portals of both newspapers.
The positive reactions of the readers convince us of the meaningfulness of our work as well as a possible extension of the project. At least three possible uses can be imagined: 1) another, follow-up project created by the users themselves (user-generated content supervised by professional editors); 2) other language pairs (Czech-German/Polish, Slovak-Hungarian/Polish); 3) adding another language(s) (e.g., those of the Visegrád area: Czech, Slovak, Polish, Hungarian). We also want to encourage our fellow lexicographers to adapt a similar project to their languages. Given the venue of this year’s Euralex, a Croatian-Serbian version is suggested (possibly extended to the BCMS phenomenon), besides other European languages, e.g., Walloon-Flemish, Finnish-Estonian, Scandinavian languages, etc., or minority languages related to majority languages (Catalan-Spanish, Basque-Spanish and/or French, Latgalian-Latvian, etc.).
Restaurant Leut, Cavtat
https://maps.app.goo.gl/9B2Fmvi7jtoDKeJg7
Good dictionary examples are hard to come by. Despite corpora growing larger and larger, lexicographers still have difficulties in finding good candidate sentences for exemplifying how the dictionary headwords are used in context. There are automatic methods available to address this time-consuming task. One such method is GDEX, a feature of the Sketch Engine tool (Kilgarriff et al., 2004), which we have been using in our lexicographic projects and for which have developed language-specific and even project-specific configurations in recent years. GDEX works on the principle of ranking (randomly selected) corpus sentences according to heuristics defined in the configuration file. This means that it can efficiently separate wheat from the chaff using criteria such as (preferred) sentence length, forbidden or preferred lemmas or forms etc. However, in many cases, there are many sentences left with high(est) GDEX scores and many of those can be found problematic, which is an issue as only the top X sentences per headword or sense are often used for dictionary purposes.
The recent arrival of large language models, in particular ChatGPT, has taken the (lexicographic) world by storm, with many lexicographers advocating the help of ChatGPT for tasks such as definition writing, sense division, and even the production of entire entries (Lew, 2023; de Schryver, 2023). Therefore, we wanted to explore whether ChatGPT could also be used for the task of example selection.
We prepared a task in which the same ChatGPT prompt in English was used on 40,000 sentences extracted from the corpora of Brazilian Portuguese, Dutch, Estonian, and Slovenian, for the purposes of preparing manually annotated datasets for teaching and learning purposes as part of the CLARIN Resource Families project (Zingano Kuhn et al., 2022). Each dataset consists of full-sentence examples for 100 lemmas (100 examples per lemma) from three different groups: a) 20 lemmas that are offensive or vulgar, b) 60 lemmas that are offensive, vulgar or sensitive in one of their meanings, and c) 20 lemmas that would typically not be considered offensive, vulgar or sensitive. The lemmas were originally selected in English and then translated into four languages; in the selection process, certain lemmas were discarded if found problematic from the perspective of comparability, e.g., due to different levels of polysemy in target languages, several possible translations, etc. In the annotation, the examples were marked as problematic or non-problematic, with the problematic ones also annotated with the type of problem: offensive content, vulgar content, sensitive content, incorrect spelling and/or grammar, and lack of context. We also replicated the annotation with ChatGPT-4 (Kosem et al., 2024). Using ChatGPT-4, the following prompt was used for the task.
Below are sentences for the <language> word “<word>”. Select the best 5 sentences that can be used in a general dictionary of <language>.
SENTENCES
<list of sentences>
In the first step of the evaluation, we compared the selected examples with the results of the manual annotation. Overall, the percentages of sentences deemed non problematic by manual annotators and selected by ChatGPT were: 70% for Estonian, 54% for Slovenian, 50% for Brazilian Portuguese, and 40% for Dutch. Of the examples deemed problematic by manual annotators, most of them were marked for sensitive content, followed by lack of context. The ratios of problematic categories per language were similar, with some exceptions (e.g.,50% or more problematic examples were marked for sensitive content for all the languages except for Brazilian Portuguese). The analysis across groups of lemmas showed that in general, the number of problematic examples per lemma is dropping from group a to c.
Next, we compared the selected examples with the results of annotation by ChatGPT with the same categories as used for manual annotation. The percentages of selected examples by ChatGPT that were also deemed non-problematic by ChatGPT was 39% for Estonian, 36% for Slovenian and Brazilian Portuguese, and 33% for Dutch. In terms of categories, sensitive content and lack of context were again most common, but there were also many examples marked by ChatGPT as containing offensive content (30% of problematic examples for Estonian, 26% for Dutch, and 25% for Slovene and Brazilian Portuguese).
While these results may seem discouraging, it should be noted that the original annotation was focused more on the pedagogical value of examples rather than on the lexicographic. To illustrate – while language teachers may want/prefer to exclude all offensive and vulgar words out their material, lexicographers have to find good examples for all types of vocabulary. Thus, we are now conducting a detailed lexicographic evaluation of the examples selected by ChatGPT. We will reannotate all the 500 examples to determine whether they are suitable for dictionary purposes. In addition, we will look at all the examples of each lemma to determine whether more suitable good dictionary examples can be found, and also to see whether there are even five good examples candidates in the dataset. The results of these analyses will be presented at the conference, where we will also provide our conclusions and recommendations in using ChatGPT for purposes of dictionary example selection.
Following the release of ChatGPT at the end of 2022, last year has been largely dominated by the advent of Large Language Models (LLMs) and their potential benefits and risks. The focus of the lexicographic community, for example at the Asialex and eLex10 conferences, was set mainly on whether and how LLMs can help (or replace) dictionary compilation and the threats they pose to the future of lexicography (e.g., de Schryver, 2023; Jakubíček & Rundell, 2023; Nichols, 2023; McKean & Fitzgerald, forthcoming; Rundell, forthcoming). On the other hand, relatively little attention was devoted to whether or how lexicography can contribute to enhancing LLMs (e.g., Kernerman, Asialex 2023). This paper will overview both trends and propose a more active role for the latter, i.e., to the lexicographic enhancement of LLMs.
The research carried out by the first group of authors mentioned above has mostly evolved around testing LLM performance of typical lexicographic tasks and evaluating the results in comparison with professional human input. These experiments included the suggestion of headwords, multiword expressions and inflected forms, sense disambiguation, creation (and recreation) of definitions, generation of usage examples, provision of citations, labels and pronunciation, etc. The overall conclusions shared by those authors have been that the (current) quality of the output was “not up to the standard of human editorial work”, as summarized by McKean & Fitzgerald (forthcoming), who also pointed out the flaws and weaknesses of LLMs in general, and recommended to implement a research program for “developing and testing prompts for common tasks, build an evaluation set to judge outcomes (which might also be useful to judge the output of human editors), and test new models and tools as they become available.”
However, the general concerns about LLMs cover a wide range of strong ethical, technical, legal, environmental, and economical issues. To begin with, most of the massive amounts of data needed for language model training stem heavily from web-crawled corpora that are often inflicted by diverse flaws, including inconsistency, bias, need to clean “noise”, and lacking knowledge of the source and usage license. (Multilingual LLMs, in particular, also risk applying data that has been generated automatically by machine translation engines and was not post-edited properly.) For languages other than English and the most major ones, relevant data might be scarce. The training process itself is very costly, as regards experts, time and GPU consumption (which also leads to worse computational pollution). They require the implementation of comprehensive post-editing and fine-tuning, whereas the outcomes are still liable to suffer from “hallucinations” that amount to deceptive fluency. Thus, the results can usually not be used for professional purposes because of unreliable performance and
potential copyright infringement.
On the other hand, quality lexicographic resources feature systemic, high-value and trustworthy linguistic data that can substantially enhance language model development, injecting greater precision, efficiency and reliability, reducing the needs for masses of data and substantial post-editing, and contributing to quality evaluation as well as to increased savings on all fronts. Detailed components and aspects such as sense division, multiword expressions, definitions, examples of usage, register and domain classification, syntactic patterns, semantic labels, grammatical categorization, and others, depict language meticulously and faithfully, and accurate translation equivalents empower multilinguality and cross-lingual linkage.
For example, the principal concepts introduced by the English learner’s dictionary pioneers in the 1930s (cf. Cowie, 1999), and their second wave since the mid-1980s (cf. Hanks, 2012; Adamska Sałaciak & Kernerman, 2016), feature many precious fundamental linguistic insights that are attempted to be achieved with the assistance of enormous amounts of problematic data, e.g., their close attention to phraseology (multiword expressions), examples of usage illustrating typical linguistic patterns, word sense disambiguation in the order of importance and frequency rather than chronologically, and indicating related domains, register, synonyms, antonyms, named entities, etc. as well as corpus-based analysis. Such “ready-made” elements offer invaluable support and facilitation to diverse LLM training, fine-tuning and benchmarking tasks.
Lexicography can therefore play a vital role in reinforcing both LLM performance and users’ trust in LLMs. We will demonstrate how lexicography in the service of LLMs can complement LLMs for lexicography.
Online dictionaries are being superseded by web search engines and chatbots. There is already a body of literature on whether the latter can assist (or replace) lexicographers (e.g., Barrett, 2023; De Schryver, 2023; Lew, 2023; Jakubiček & Rundell, 2023; Rundell, 2023). However, it is not known if these (so far) alterative reference sources help language learners in meaning comprehension and retention as much as dedicated reference works.
The study aims to determine if the source of reference affects meaning comprehension and retention. Five online sources are considered: Microsoft Copilot (powered by GPT-4), ChatGPT-3.5, Google, LDOCE and COBUILD.
The following research questions are asked:
1. Does meaning comprehension depend on whether ChatGPT 3.5, Copilot,
Google, LDOCE or COBUILD are consulted?
2. Does the source of reference affect immediate and delayed retention of
meaning?
An online experiment was based on 25 difficult words from De Schryver (2023). Prior to the study, 21 learners of English (B2) read the paper and took down the words they did not know, of which 25 most frequently listed ones were selected.
The experiment consisted of a pre-test, a main test and two retention tests (immediate and delayed). In the pretest and both post-tests, participants provided L1 equivalents of the target words relying on their knowledge. The pre-test showed whether they knew the words before the study. The post-tests checked meaning retention immediately after the main test and two weeks later. In the main test, the subjects supplied L1 equivalents after reference to the five aforementioned sources. A counterbalanced design was employed; each five 64 words were paired up with a different source. Source assignment was rotated across the words in five test versions (A-E). 109 upper-intermediate foreign learners of English (B2 in CEFR) participated in the experiment (test version A–21 students, B–23, C–20, D–22, E–23).
The results of a repeated-measures MANOVA indicate a statistically significant role of source (Wilks’s lambda=0.01, F=106.64, p<0.001, partial η2=0.784). To explore it, repeated-measures MANOVAs were conducted for each dependent variable. Significant MANOVA results were further analyzed with the Bonferroni test.
Meaning comprehension depended on the source of reference (Wilks’s lambda=0.12, F=39.23, p<0.001, partial η2=0.414). Comprehension proved the most (and comparably) successful when ChatGPT-3.5 (71.02%) and COBUILD (73.99%) were consulted (p=1.00); it was much more successful for these two sources than the remaining three (p<0.01). Copilot (61%) had a significant advantage over Google (53.66%, p=0.01) and LDOCE (53.48%, p=0.00). The latter two brought about comparably low comprehension (p=1.00).
The source referred to affected immediate retention (Wilks’s lambda=0.06, F=80.63, p<0.001, partial η2=0.503). ChatGPT-3.5 (52.64%) and COBUILD (50.96%) helped to retain meaning most successfully (p=1.00); they were significantly more useful for immediate retention than the other sources (p<0.01). Over two fifths (42.68%) of meaning explanations were remembered when Copilot had been consulted. Significantly less (p<0.00), about one third of the semantic content, was retained when Google (31.84%) and LDOCE (32.55%) had been used, with no difference between them (p=1.00).
Delayed retention depended on the consulted source, too (Wilks’s lambda=0.10, F=46.92, p<0.001, partial η2=0.437). COBUILD (32.10%) and LDOCE (29.13%) helped to retain the meaning of about one third of the words two weeks after exposure (p=0.07). Delayed retention after Google search (28.45%) was as successful as after reference to LDOCE (p=1.00), but significantly less successful than among COBUILD users (p=0.01). The least was remembered when ChatGPT-3.5 (21.43%) and Copilot (18.88%) had been consulted (p=0.19).
The study gives affirmative answers to both research questions; the source of reference affects meaning comprehension (Q1) as well as immediate and delayed retention (Q2).
In order to understand meaning and remember it immediately afterwards, COBUILD and ChatGPT-3.5 are the most recommendable, followed by Copilot. Google and LDOCE are the least helpful. However, in the long run, meaning is remembered best when COBUILD and LDOCE are consulted, and the latter is as useful as Google. Both chatbots assist delayed retention the least.
The paper reports a pilot study on the detection of lexical semantic variation in modern Swedish. The starting point of the study is the meaning descriptions of around 65,000 headwords in ’The Contemporary Dictionary of the Swedish Academy’ (SO, 2021) covering approximately 100,000 different senses. In our work, we aim to explore the potential of the latest computational methods to discover outdated definitions in SO and update them. For this, we make use of the DURel tool (Schlechtweg et al., 2018, 2024) which relies on state-of-the-art language models for the automatic semantic analysis of word usages. The work resulted in drawing lexicographers’ attention to both main senses and subsenses that should be added to the dictionary. It has also demonstrated that certain meaning descriptions in SO are too general and should be split in accordance with
the current principles for the semantic descriptions in the dictionary.
This paper is about morphology and semantics, and about how their interaction is reflected in the choices made by lexicographers about dictionary entries. The paper discusses run-ons, zero-affix morphology, and the relationship between variant forms and lexical ambiguity. We compare run-on entries with variants that are headwords across different dictionaries. We found high agreement about the variants that were attested as headwords, and low agreement about the variants that were attested as run-ons. Corpus data showed differences between run-ons and headwords as well. We also compared a sample of variants that have an explicit affix with those that do not. We found many similarities with regard to the lexical semantic relationships that are involved. The paper concludes with a discussion of criteria for when derivational variants should be listed in a dictionary, and the opportunity presented by an electronic dictionary to teach a user about morphology and meaning.
This communication aims at discussing how syntagmatic constraints in the lexicon can be provided in lexicographic resources more effectively than has been done to date, covering a wide range of multi-word expressions: from compounds to collocations and phrasemes. Examples are taken from the ongoing implementation of a multilingual specialised resource called ALMA – Multimedia Linguistic Atlas of Bio/cultural Food Diversity (Caruso at al., in press), but the proposed microstructural organisation can also be applied to general language dictionaries.
Multiword expressions and fixed syntagmatic units of the lexicon are key to fluent writing and speaking, and despite the prompt support that could be provided by writing assistants when writing a text, learning lexical constraints remains paramount for real-time interactions. Traditionally, dictionaries assisting in writing have provided synonyms and opposites to help find the most appropriate word to express the writer’s ideas. In recent years, on the other hand, there has been a tendency to offer alphabetical lists of the phraseological units associated with the lemma next to or below the article (e.g., CA; COUBILD; De Mauro; LDOCE; OELD). Nevertheless, a more comprehensive representation of the semantic domain may prove beneficial in assisting users in formulating statements on specific topics, and an onomasiological arrangement of relevant syntagmatic units may be instrumental in achieving this goal.
For the microstructure of ALMA, principles for sketching an intuitive ontological organisation of lexicographic data have been derived from Pustejovsky’s Qualia (Pustejovsky, 1991; 1995; Pustejovsky & Jezek, 2008; Pustejovsky & Rumshisky, 2008; Pustejovsky et al., 2014). The plural form of the Latin interrogative pronoun “quale” (or ‘what’), Qualia, capture the most salient features of entities denoted by words, positing that human knowledge of objects stems from answering four essential questions about the entity’s (i) class and domain, (ii) purpose and function, (iii) constitutive parts, and (iv) origin:
[i.] Formal quale: What kind of thing is it, what is its nature?
[ii.] Constitutive quale: What is it made of, what are its constituents?
[iii.] Telic quale: What is it for, how does it function?
[iv.] Agentive quale: How did it come into being, what brought it about?
(Pustejovsky & Jezek, 2016).
For example, ‘bread’ has its Origin in ‘kneading’ and ‘being baked’ (see knead and bake in the article, Figure 1), while different actions can be performed to add Constitutive parts, or condiments or other food, such as dip, smear, top, as illustrated in the article example: smear the bread lavishly with softened butter. In ALMA, the non-technical terms listed above are employed instead of Pustejovsky’s semantic terminology.
The syntagmatic units’ arrangement in the microstructure of Figure 1 guides users to find collocates or related words for speaking or writing about bread. For instance, white bread refers to the wheat flour used to bake this type of food, which has a characteristic white colour. The compound therefore appears in the Origin section along with the collocation fresh bread, meaning ‘a bread that has just been baked’. Other collocations, such as the Portuguese pão dormido (lit. ‘sleeping bread’), which means ‘stale bread’, stands in an ‘is a’ relation with bread, reflecting a Class and domain meaning which is listed accordingly in the article search-zones (Gouws, 2014).
Pustejovsky’s Qualia also facilitate the explicit portrayal of cultural information encoded in the lexicon, enabling cross-linguistic comparisons in metaphorrich domains like food. For instance, ‘kneading bread’ is lexicalized in Spanish as amasar or Portuguese as amassar, meaning ‘compacting the ingredients’, whereas Italian uses the denominal verb, impastare, derived from impasto, or ‘dough’. Similarly, in English and Spanish, making bread and hacer pan have synonyms that refer to the instrument used for cooking, such as baking bread [1] and hornear el pan [2]:
[1] Do you bake your own bread?
[2] Muchos italianos han comenzado a hornear sus propios panes para
recortar gastos.
The above are figurative units having “an image component […or…] a specific conceptual structure mediating between the lexical structure and the actual meaning. Hence, the content plane […] not only consists of a pure ‘meaning’, i. e. actual sense denoting an entity in the world, but also includes traces of the literal reading underlying the actual meaning” (Dobrovol’skij & Piirainen, 2021, p. 14). The actual meaning and literal reading are formalised according to the Qualia Structure in the computational lexicon of ALMA, using the external ontology of the SIMPLE model (Lenci et al., 2000). To represent the literal reading, syntagmatic units receive a second-level annotation, describing the semantic relationship existing among their elements. For example, pão and dormido are in an “agentive” relationship, as the act of sleeping alludes to the time it takes for bread to become ‘stale’. The annotation of individual components in the multiword is formalised using the OntoLex-Lemon model (McCrae et al., 2017) in addition to the SIMPLE framework.
Such formal structuring allows sophisticated access to the data through tailored queries, facilitating deeper insights into language structure and usage patterns, while also improving the representation of cognitive mechanisms behind phrasemes formation.
This research therefore aims to provide more valuable and comprehensible data for ‘training’ human learners and more reliable microstructural templates for machine-readable dictionaries that can be used for various NLP tasks, particularly in machine translation systems.
This presentation outlines the development process of DICIENS, a bilingual school science dictionary (English-Spanish/Spanish-English) designed for primary education students in Spain. DICIENS marks a pioneering initiative, filling a significant gap in educational resources and pedagogical lexicography. Rooted in the theoretical framework of Frame-based Terminology (Faber 2009, 2012), this endeavor integrates robust theoretical foundations with practical application. Initially, we conducted a comparative analysis of terminological entries in monolingual school science dictionaries (in English and Spanish) aimed at primary education students (Buendía-Castro, 2024). All dictionaries analysed agreed on the inclusion of a definition, usage examples, and pictures. Of Ptasznik’s (2021) four definition types, full-sentence definitions (for nouns) and single-clause when-definitions (for verbs) were most commonly found. This foundational work aimed to craft an ‘ideal’ terminological entry tailored to the comprehension levels of young learners engaging with science instruction in bilingual school environments in Spain. In the subsequent phase, we compiled a comparable corpus comprising English and Spanish texts from prominent textbook publishers in Spain. Leveraging a terminology extractor, we meticulously selected candidate terms. Subsequently, in collaboration with the educational team that belongs to the project, we refined a selection of 200 headwords to form the core lexicon of DICIENS. Each candidate term underwent rigorous scrutiny, with Sketch Engine employed to extract definitional, contextual, and phraseological insights tailored to the intended user profile. Complemented by an accompanying image bank, ten beta entries underwent evaluation by the educational team. This iterative process facilitated the redefinition of the ‘ideal’ DICIENS terminology entry model and the integration of corpus-derived information into the lexicon. As shown in Figure 1, each headword, apart from including the definition, usage contexts and pictures, also includes the main verb collocations, as collocations contribute significantly to the naturalness and idiomaticity of the language. Considering that errors involving collocations are among the most common mistakes made 41by learners during the process of acquiring a second language, we consider that internalizing phraseological information from an early age is essential. Following the approach of Buendía-Castro, Montero & Faber (2014), who developed a semantic classification for verb collocations in the specialized resource EcoLexicon (ecolexicon.ugr.es), it is our assertion that it is very useful to provide verb collocations categorized by meaning in DICIENS. In subsequent phases, we will establish the infrastructure for data storage and retrieval, alongside designing and implementing an online interface for user-friendly access to dictionary entries. DICIENS represents a significant advancement in educational settings, bridging the lexicographic gap in science education. By catering to the unique needs of bilingual learners, it promises to enhance pedagogical effectiveness, empower educators, engage parents, and enrich student learning experiences. The methodological framework established in the development of DICIENS holds promise for the creation of similar bilingual school science dictionaries across diverse linguistic contexts.
This study has two main objectives: firstly, to provide a concise overview of the evolution of Maltese lexicography, and secondly, to shed light on the latest advancements in the field within the local context. Following select criteria for dictionary reviews checklist, this paper reviews the most significant dictionaries in the Maltese lexicographic tradition, starting from Thezan’s dictionary, compiled in the the early 17th century, and continuing with the most important ones from each subsequent century.
Diretes is a Spanish monolingual e-dictionary based on Lexical-Semantic Relations which are formalized by Lexical Functions, a formal tool explored within the Meaning-Text Theory. This dictionary consists of a relational database which aims to reflect the cognitive links of the lexicon through a network of semantic and lexical associations. Currently it contains more than 100,000 collocations and semantic relations, and includes dictionary entries for simple words, compound words, idioms and formulaic expressions. The majority of the definitions of each entry consist of two parts: a minimal definition (written with basic words) and an expanded definition (which collects world knowledge data). The structure of the entries of the dictionary for formulaic expressions, however, are more complex as they also include a pragmatic function, a prototypical scenario, and certain other 247specific information. Semantic relations include synonymy, antonymy, hyponymy and other productive relations not utilized to date in other dictionaries. Lexical Functions allow the user of the dictionary to consult the entries, including idioms, through both onomasiological and semasiological approaches. This paper summarizes the purpose of the creation of Diretes and the idiosyncracies of this dictionary.
Recent developments in the scaling and training of large language models (LLMs) have led to a dramatic change in how the public views Artificial Intelligence. No longer the vaguely aspirational preserve of science fiction stories, AI is now expected to work, and not just in the laboratory but in a wide range of consumer products. Yet as AI outperforms people on tasks that were once considered yardsticks of human intelligence, one area of yhuman experience still holds out, for now at least: our very human sense of humour. This is not for want of trying, as this talk will show. There is good reason for computer science to take humour seriously. By building computer systems with a sense of humour, capable of appreciating the jokes of human users or even of generating jokes of their own, we can turn academic theories into practical realities that amuse, explain, provoke, and delight. The writer Clive James once pronounced that one should not trust anyone lacking a sense of humour, even, indeed, to post a letter, for what is humour but our sense of equanimity and poise in the face of the unpredictable when common sense has been pushed to the brink? My talk will describe where researchers are on this road to more humorous machines, and explore how we might go further towards giving LLMs a robustly human funny bone. The talk will also cover related issues such as acceptability and value alignment in LLMs, since humour often pushes the bounds of what is socially acceptable in polite company.
Large language models (LLMs) have attracted much attention in lexicography since the release of ChatGPT in late 2022. Several studies have explored their use in dictionary-writing tasks (e.g., Lew, 2023); others have also raised concerns about their limitations and risks (e.g., McKean and Fitzgerald, 2023); de Schryver (2023) provides a useful overview and analysis.
This paper explores how LLMs can support and enrich the Oxford English Dictionary (OED), a large historical dictionary with over half a million entries. The OED has a history of adopting pioneering technologies, such as computerization and the use of electronic text archives and corpora (Gilliver, 2016, p. 542, pp. 559–63); and in recent years it has benefited from machine-learning projects such as the semi-automated expansion of the OED’s Historical Thesaurus (McCracken, 2015). It is in this spirit that OED staff have approached possible uses of LLMs. Like other lexicographers, we have been exploring the potential of LLMs to accelerate drafting of dictionary content, but we are particularly interested in cases where LLMs could work at scale across the whole dictionary, for example generating draft definitions for undefined derivatives, modernizing unrevised definitions, and assigning illustrative quotations to senses. We present our findings from experiments with various prompts, and report especially on the importance of well-structured prompts and high-quality examples, and on the essential role of human editors in refining prompts and reviewing output. We also discuss our findings regarding the outputs of different LLMs and parameters.
We identify areas where LLMs perform relatively well (for example some types of definition and usage notes), areas where they currently perform less well (especially with historical data), and areas for future investigation. We also discuss other opportunities that LLMs could create for the OED. For example, we are currently in the early stages of planning a new, large corpus of historical English, and investigating some of the ways in which machine learning and LLMs could be used in tagging and annotating the data. We are also exploring the ways in which LLMs could transform the user experience of the OED: for example, a conversational interface powered by an LLM would allow users to search the OED using natural language queries, obviating the need for complex advanced searches and enabling novel possibilities for interacting with the dictionary or its data. If successful, such an approach might be extended to other tasks, such as the natural language querying of corpora.
There are particular challenges for the OED in all of these propositions, not least the fact that the LLMs available at the time of writing (such as GPT-4, Claude 3, and LLaMA 3) are trained on modern texts and not well suited to the analysis of historical data. While some historical language models have been developed (e.g., MacBERTh: see Manjavacas and Fonteyn, 2021), the field of historical LLMs is at a very early stage. Another challenge is the presence of third-party filters, which limit the use of LLMs in handling sensitive material. Furthermore, we share the widely-expressed concerns about errors and hallucinations generated by LLMs, and we are keen to find ways to mitigate these, for example through retrieval-augmented generation.
LLMs and other AI tools also offer unique opportunities for the OED, given its size and scope. Beyond the explorations summarized above, we discuss ways that an AI tool could carry out tasks on a scale that would not be possible for a human lexicographer: for example, identifying significant antedatings, gaps, omissions, or discrepancies across the whole text, which could help prioritize entries for revision; or identifying and visualizing patterns and connections across the dictionary. We anticipate that AI will revolutionize lexicography in a similar way as corpora have in previous decades: not by replacing but by enhancing the work of human editors.
The arrival of generative language models such as ChatGPT, developed by OpenAI, has sparked considerable interest among the general public, linguists, and lexicographers. As the roundtable discussion at eLex 2023 in Brno clearly showed, the lexicographic community is split between excitement and scepticism. Although there is growing recognition of the benefits that generative AI can bring to lexicography, the jury is still out regarding the full scope of its impact. For some, ChatGPT “does not herald ‘the end of lexicography’” (Rundell, 2023, p. 9), but for others it makes dictionaries, lexicographers and post-editing lexicographic tools redundant (de Schryver, 2023).
Initial research into the potential of ChatGPT for lexicography (Jakubíček & Rundell, 2023; McKean & Fitzgerald, 2023; Rundell, 2023) has largely relied on the authors’ own expert evaluations. Lew’s (2023) study is unique in this context as it employed a blind review process, where human experts assessed dictionary entries taken from Collins COBUILD Advanced Online and those generated by ChatGPT-3.5. The study found that the quality of definitions generated by ChatGPT was “practically indistinguishable” (Lew, 2023, p. 8) from those produced by COBUILD lexicographers. Further evidence of ChatGPT’s potential comes from dictionary user studies. The first such study, carried out by Rees & Lew (2023), used a multiple-choice reading task to compare the effectiveness of ChatGPTgenerated definitions with those from the Macmillan English Dictionary (MED) in
helping users understand unknown vocabulary. The study revealed that students with access to MED definitions significantly outperformed those without any definitions, yet no significant differences were noted between students using ChatGPT-generated definitions and those with MED definitions or without any definitions.
Investigating ChatGPT’s effectiveness in acting as a lexicographer and in producing dictionary entries represents one approach to exploring its potential. Yet, it is possible to imagine a post-dictionary future, where dictionaries “will at best be subsumed within, at worst gobbled up by, other digital tools” (de Schryver, 2023, p. 380) and traditional ways of presenting lexical knowledge in dictionary-like format will no longer be needed (de Schryver, 2023, p. 380). In such a future, users may no longer have to struggle with the various challenges inherent in consulting existing dictionaries such as finding, interpreting and applying information. Although dictionaries are unlikely to soon disappear completely, there may come a point when they are outperformed by large language models (LLMs) in providing language assistance. Given this possibility, it is worth investigating ChatGPT’s capacity to act not as a lexicographer producing dictionary entries, but as a language support tool capable of performing various language tasks in response to user prompts.
The present study aims to contribute to the existing lexicographic literature by addressing a key question for the emerging lexicographic landscape: Can ChatGPT effectively perform language tasks that would traditionally be performed with the aid of a dictionary? A secondary aim of the study is to compare the effectiveness of three ChatGPT models: 3.5, 4 and 4o, thus providing additional insights into the usefulness of the tool. To test the effectiveness of ChatGPT vis-à-vis dictionary consultation, the present investigation draws upon 10 published user studies (see the references) involving dictionaries, where experiment participants performed specific language tasks with a dictionary’s assistance. All of the studies appeared in the International Journal of Lexicography or as monographs, with a key selection criterion being the availability of the instrument used for measuring participants’ performance. Additionally, care was taken to ensure that both productive and
receptive skills were investigated. The tasks from those studies were then submitted to ChatGPT. To ensure comparability between the performance of ChatGPT and that of dictionary users in the original studies, the prompts given to ChatGPT closely mirrored the instructions provided to the participants in the original studies. However, dictionary entries used in the original experiments were not included in the prompts, the only exception being the study by Lew (2004), as the words tested there were pseudowords. The results reported in the original studies served as benchmarks against which ChatGPT’s performance was compared. The study’s findings, to be presented for the first time at the
EURALEX congress, will demonstrate ChatGPT’s strong potential as an alternative to traditional dictionaries in a range of tasks typically associated will them. While its performance varies depending on the task, ChatGPT sometimes achieves a perfect score, outperforming traditional dictionary consultations. This suggests that the dawn of the post-dictionary world may soon be upon us.
In this paper, we argue that it is beneficial for a constructicon to have a query interface where the user can enter arbitrary text. To allow this, we present a dependency-tree-based model for representing constructions, and show that this model can serve as a basis for working out a user-friendly query interface which analyses the free-text user query and matches the constructicon entries to it, revealing all constructions from the query text for the user, without expecting any knowledge of construction grammar.
Over the last few decades, the influence of the English language onto the Latvian language has significantly increased. This is due to geopolitical events, rapidly growing media and technological advances where the global language is English. Consequently, the influx of English borrowings into Latvian has increased,raising the incidence of false friends. Notably, some long-established false friends either adopt another meaning or lose the original one aligning more closely with their English counterpart. French has also impacted Latvian, albeit through intermediary languages such as German and Russian. In this paper, examples of false friends in English-Latvian and French-Latvian bilingual dictionaries will be viewed and compared. Currently, there is one English-Latvian false friend dictionary published over three decades ago; the necessity for revisions that
would account for recent linguistic developments need to be determined. Contrastive lexicographic analysis is applied.
We present a study of Danish multiword constructions containing one or more hyphens, such as gas- og vandmester (‘gas- and water.repairman’; ‘plumber’), ilt- og brintatomer (‘oxygen- and hydrogen atoms’) and haveborde og –stole (‘garden tables and -chairs’). Although materially analogous, such constructions exhibit different semantics, falling – as we shall argue – into two distinct groups (“locum” vs. “pseudo locum” hyphen constructions). This study employs the COR1 database and draws on the CLINK formal framework for computational lexicography. The aim of this paper is to demonstrate how linguistic analysis of such multiword expressions with rich internal semantics can benefit from methods of computational lexicography – and vice versa.
Terminology within the domain of environmental economics, a rapidly growing and changing sub-discipline of economics concerned with environmental issues, has been understudied in the literature on domain-specific languages. While, on the one hand, it presents the common features of specialized vocabulary, i.e., monoreferentiality, precision, economy and objectivity (Gotti, 2008; Scarpa, 2020), it also has its own peculiarities because it lies at the intersection among different fields, not just economics and environmental science, but also law, commerce and business. Hence its hybrid and complex cognitive nature, often blending concepts imported from unrelated domains. The name itself of an emerging subfield of environmental economics, i.e., envirodevonomics, studying environmental quality in developing countries (Greenstone & Kelsey, 2015), is a case in point, as it creatively blends words referring to multiple domains of knowledge.
The aim of this presentation is twofold. First, it intends to illustrate the features of a newly created trilingual (English-Italian-Spanish) glossary of environmental economics terms which will soon be made available on Lexonomy (Méchura, 2017; Rambousek, Jakubicek, & Kosem, 2021). The termbase consists of approximately 1,000 items automatically extracted from a two-million word corpus (ENCO Corpus) compiled specifically for this research. The corpus includes academic papers published in journals of environmental economics freely accessible online and public reports from government agencies and companies, providing additional context and depth from outside the academic world. Although some lexicographic resources in the field of environmental economics already exist (e.g., Acks, 1997; Markandya et al, 2001), they are only monolingual, somewhat dated or just in printed form. In today’s mobile society, however, there is a need for easily
accessible, updatable, electronic dictionaries (Fuertes Olivera, 2018; Jackson, 2018) and multilingual termbases to be used also for translation purposes and in the context of specialized language teaching/learning. After presenting a demo of our English-Italian-Spanish glossary of environmental economics terminology and discussing the similarities and differences with IATE (Interactive Terminology for Europe), EU’s terminology
management system, we aim to examine the different term formation patterns in the three languages, with a focus on their cognitive implications. In particular, we will show how meaning construction is subject to varying conceptual structures (Faber Benítez, 2009), frames (Fillmore, 1982; 1985; Lakoff, 2004; 2010) and ideologies (Liu, Lyu, & Zheng, 2021) in the three languages under investigation. The interpretation of the compound term carbon footprint, for instance, is possible by virtue of a metaphor, whereby footprint, which activates the walking frame, is understood as the ‘impact that our actions have on the environment’ due to carbon dioxide emissions associated with human activity. Through
calquing, both Italian and Spanish employ similar metaphoric expressions, i.e., impronta di carbonio and huella de carbono, respectively. In other cases, though, the meaning construction dynamics are different in the three languages. The term phase-out, originating from the field of physics which is still indirectly evoked, in the expression coal phase-out (i.e., the gradual elimination of coal) does not have exact counterparts in Italian or Spanish: eliminazione pprox.ng del carbone (Italian) and pprox. gradual del pprox (Spanish) are based on a paraphrase and are consequently less technical. The opposite scenario may also be observed, i.e., environmental economics terms in either Italian or Spanish sometimes exhibit a higher level of precision and a more restricted scope of application: the English term greenwashing is rather generic, as its meaning shares a conceptual link with other metaphorical uses of ‘washing’ in contexts beyond environmental concerns (e.g., whitewashing, covering up crimes and forms of corruption; pinkwashing or rainbow washing, attempting to benefit from the support of LGBTQA+
rights; sportswashing, investing in sports to promote a country’s reputation while redirecting public attention away from unethical conduct; and so forth). Its equivalents in both Spanish and Italian are instead based on more concrete metaphors: the Spanish equivalents ecoimpostura, ecoblanqueo and blanqueo verde specifically activate the (il)legal(ity) frame, while both ecologismo/ambientalismo di facciata in Italian and ecopostureo in Spanish explicitate the notion of something deceptive or inauthentic.
The ultimate goal of the present study is to provide an initial multilingual representation of terminology organization and of the recurrent patterns of knowledge modelling in the specific domain of environmental economics.
15.00 departure from Cavtat in front of Hotel Croatia*
16.00 18.00 Dubrovnik guided tour
18.00 21.00 free time in Dubrovnik
21.00 departure from Dubrovnik at Pile Gate
*please be in front of the Hotel Croatia at least 5 minutes before departure
Since 2019, the Institute of the Estonian Language (EKI) has been compiling the EKI Combined Dictionary (CombiDic). Our presentation concentrates on incorporating synonyms into the CombiDic using the dictionary writing system Ekilex (Tavast et al., 2018; Tavast et al., 2020), where we have two types of synonyms – full and partial. We acknowledge that full synonymy is a rare phenomenon within a language (see, e.g., Cruse, 1986), but in our data model, words considered full synonyms are connected to the same meaning entity (share the exact same definition) and are mostly interchangeable. Partial synonyms are represented by meaning relations, which indicates that their meanings are similar, and they can be interchangeable in certain contexts. Users can see both types of synonyms in language portal Sõnaveeb1 (Koppel et al., 2019), where full synonyms are displayed prominently in bold text and partial synonyms in standard type. While full synonyms are added mainly by the team working on sense division, our team focuses on adding partial synonyms in a specially developed taskoriented interface of Ekilex, where a lexicographer is presented with the sense division of a given headword and an automatic list of synonym candidates. When two headwords share similar meanings, compilers can simply drag and drop the corresponding senses, thereby establishing a meaning relation that is displayed as a partial synonym within the interface. In Ekilex, all synonyms are bidirectional (if A = B, then B = A). Whenever full synonyms are created by connecting words to the same meaning, the system displays them as a cluster (if A = B and B = C, then A = B = C). If, e.g., a lexicographer includes A as a partial synonym to D, B and C in the cluster will be partial synonyms to D automatically. Then again, creating a meaning relation between partial synonyms only creates a bidirectional relation between two senses and does not connect other partial synonyms to the cluster. Occasionally, bidirectional compilation may present both semantic as well as technical difficulties for lexicographers, including the following.
Inconsistencies in sense division between synonymous words. The benefits of adding partial synonyms include various other findings within the data, such as identifying missing senses and inconsistencies in the sense division of similar words. • Broader and narrower senses. This is a question that particularly concerns partial synonyms. It is inherent for partial synonyms to be synonymous in a particular aspect, but not in the scope of the whole given dictionary sense. It may be useful for lexicographers to group different usage possibilities into a single sense, but this can present a problem from the point of view of adding synonyms. For example, combining different types of movement under one sense for the verb tantsima ‘dance’ may seem economical, but adding synonyms in a single row for “dancing” of birds, animals and inanimate objects (e.g., a plastic bag in the wind) will create a long disconnected list. • Fuzzy lines between synonymy and hyperonymy. For example, a hyponym can be partially synonymous with its hyperonym, e.g, T-särk ‘T-shirt’ = särk ‘shirt’, triiksärk ‘dress shirt’ = särk, etc. Adding hyperonyms as synonyms to each hyponym, a list of hyponyms will be shown as synonyms to the hyperonym. • Parts-of-speech issues arise when a word has shifted out of its paradigm in a particular meaning and has acquired the characteristics of another word category. For example, null ‘zero’ (numeral/noun) in the phrase tuju on nullis ‘the mood is at zero” can semantically be viewed as a partial synonym with adjectives like olematu ‘non-existent’ and halb ‘bad’. It might confuse the users of Sõnaveeb when there is suddenly a numeral/noun in a list of adjective synonyms. It is difficult to present grammatical and syntactic synonymy because not all potential synonyms are traditionally headwords in a dictionary – short stems (e.g., digitaalne (adjective) = digi- (used only in compounds) ‘digital’, inflected forms (e.g., täppidega ‘with spots’ as a synonym for täpiline ‘spotted’), etc. In our presentation, we present more examples and discuss potential solutions to the semantic and technical challenges we face with bidirectional compilation of synonyms.
In our poster presentation, we will present the results of the experiment that tests the potential of large language models (LLMs) in semantic analysis of Estonian. We will focus on LLMs’ ability to analyse polysemy and create definitions. In 2024, the Institute of the Estonian Language started a new project in which we are exploring how LLMs, such as GPT, can help with the presentation of dictionary information. The representation of polysemy in dictionaries is still one of the most difficult tasks that lexicographers are confronted with, even with the availability of modern automatic tools, e.g., Sketch Engine (Kilgarriff et al., 2014). As Adam Kilgarriff (1992) has summarised, there is no consensus among researchers on what constitutes a word sense; nor how broad or narrow these should be; nor definitive guidelines how to determine where one sense ends and another begins. Lexicographers nevertheless try to describe the meanings of a word. Although in the case of a very polysemous word, every lexicographer would probably compile a different word sense division, even if the corpus data and the used corpus tools would be the same. As the word senses in a dictionary are abstractions and generalisations, it is difficult to determine an objective gold standard for this task. We will develop a lexicography-specific methodology to evaluate the output of LLMs. First, we will compile a sample that is representative of the relevant units of the Estonian language (including words with different number of senses in dictionaries) and prompt LLMs with the related tasks. Experts will be included in the evaluation. LLMs have not yet been systematically included in the dictionary work in Estonia. Worldwide there already exist dictionary systems, e.g., TLex (Joffe et al., 2003), that have artificial intelligence built in and which enable lexicographers to compile word articles using AI. Since 2019, the Institute of the Estonian Language has been aggregating language data into a single dictionary and terminology database called Ekilex (Tavast et al., 2018). This open-source database would also allow for integration with AI and NLP. As a result of the project, we plan to innovate the compilation methods by utilizing advanced technology and create language datasets for lexicographers. For English LLMs have been tested on their ability to generate different lexicographic macro- and micro-structural components (e.g., Jakubíček & Rundell, 2023). The best results have been achieved using ChatGPT for generating English definitions (de Schryver & Joffe, 2023; Lew, 2023). Therefore, this is one of the tasks we are interested in testing for Estonian as well. Some of the LLMs on the market support Estonian, e.g., GPT-3 (Brown et al., 2020) and GPT-4 (OpenAI 2023). These models are trained on vast datasets collected from the internet, but the inclusion of Estonian texts is not a deliberate effort, rather a byproduct of data collected. However, the effectiveness of these models in solving lexicographyrelated tasks and for the Estonian language in general is yet to be determined. Our work will contribute to this area by providing results for the following models: GPT-4, GPT-4o and Gemini 1.0 Ultra. The aim is to test these models without adjustments, but depending on the results, further fine-tuning might be needed in future research.
In 2023 the first edition of the diachronic-contemporary Swedish Academy Dictionary (SAOB) was finished after more than 130 years of work. The dictionary contains over 500,000 entries and describes the Swedish language from the 1520s to the present day. However, a consequence of the extended time of publication is that a significant part of the content, in particular in the first volumes, was out of date even by the time of completion of the full edition. Headwords had become obsolete, the meta-language was outdated in several regards, and a lot of modern words are lacking in the lemma list of the dictionary. To address this, a revision project was initiated in the beginning of 2024. This article reports on the revision in more detail, with emphasis on the supplement of new entries.
A medication package insert is a legal healthcare document with important information about medications. In Brazil, the National Health Surveillance Agency (ANVISA) requires two versions of the package insert: one for patients and another one for healthcare professionals. In this study, we manually evaluated the performance of an automatic frame annotator on a corpus consisting of 100 sentences targeting patients and 100 sentences targeting healthcare professionals. The aim was to evaluate whether the parser’s output evidenced correct assignment of semantic frames and their frame elements (FEs) in each input sentence and to what extent human post-annotation would be necessary to improve the output. Text target audience was defined as a variable potentially impacting frame detection given differences in the language of package inserts. Overall, the findings demonstrate the efficacy of the automated annotation process, revealing challenges that have to do with the same form being capable of classification under different categories and/or frames. Few differences were found when comparing sentences for the two distinct audiences targeted in the texts.
The work with historical documents presents many challenges, not only because some sources are not well preserved, but also because grammar and spelling rules from older times were not always consistent. Still, these texts remain as a rich source of information from our history, and we could greatly benefit from the information that can be extracted from them. At the same time, the lack of spelling and grammatical consistency poses a problem for the application of computational tools, so most of the analysis work is done manually. To overcome this lack of consistency, researchers started normalising the spelling of historical documents, as this increases the performance of modern tools.
Spelling normalisation is, however, also carried out manually most of the time. In this paper, we present some experiments that were done for automatically normalising historical documents in two languages: Portuguese and Albanian. Leveraging state-of-the-art large language models that were pre-trained for translation, we used corpora that were carefully curated and manually normalised to train new computational models. These models can automatically normalise documents in these languages, achieving new state-of-the-art BLEU scores above 90 for Portuguese, and up to 59 for Albanian, beating the task baselines.
Existing research on ChatGPT in lexicography is undoubtedly valuable. However, it has tended to focus on metalexicographic concerns rather than effectiveness in resolving user queries directly. Moreover, it has mostly dealt with general-purpose English lexicography, often ignoring other languages and specific purposes. Focussing on 33 L1 Spanish users completing an introductory training course on the use of the Python programming language for linguistic research at a Spanish university, this study attempts to fill these gaps. Participants responded to ten multiple-choice questions designed to test understanding of basic programming terms. Approximately half received explanations from a respected introductory Python textbook written in Spanish. The remainder received ChatGPT 3.5-produced explanations written in Spanish. GPT-generated explanations offer performance advantages while textbook definitions offer advantages in processing time. In follow-up interviews, several participants reported feeling overwhelmed by the quantity of explanation provided by ChatGPT.
Corpus Pattern Analysis, CPA, is a technique for identifying local semantic and syntactic information of a word and map it to its meanings. In verbs, it consists basically of the argument structure labelled with semantic types for each argument. CPA is used in several dictionary projects and allows systematic corpus analysis; however, it is extremely time-consuming. In this paper, we present a method for automatic pattern identification of Spanish verbs in corpora. We used a syntactic parser for dependency analysis (Stanza), applied a NER tagger from the Flair NLP framework for named entity recognition, and for common nouns, we implemented a semantic tagger and a word sense disambiguation method, both created for the task. All resources were combined to extract CPA verb patterns. The method performs better than previous attempts and can contribute to a
more efficient pattern-based lexicography.
The paper details the current state of an ongoing collaboration between Hungarian lexicographers and computational linguists. Our goal is to provide a comprehensive and consistent description of Hungarian adjectives, benefiting lexical semantics, lexicography and NLP. This thread of research focuses on identifying systematic semantic patterns of Hungarian adjectives and their typical subcategorization frames, with a particular emphasis on polysemous meanings. The proposed methodology is entirely unsupervised, reducing reliance on human intuition. It is based on a graph representation derived from adjectival static embeddings. The algorithm models adjectival semantic domains by specific subgraphs, namely, connected graph components. In the next step, potential
subcategorization frames for the detected adjectival semantic domains, so called meaning structures, are also derived from corpus data. Then, a sample of the meaning structures is compared to the entries of the Concise Dictionary of Hungarian, evaluating the pros and cons of the proposed algorithm. Finally, as a further improvement, the automatically derived subcategorization frames were generalized.
A common issue in Corpus Linguistics is assessing representativeness and balance of a corpus (McEnery & Hardie, 2011). Biber (1993, p. 244) defines representativeness as “the extent to which a sample includes the full range of variability in a population.” Assessment has been traditionally tackled quantitatively and qualitatively both in monolingual and bilingual settings (Stefanowitsch, 2020). However, the issue is often overlooked and far from being solved, especially when one is working with specialized multilingual lexicography, where cultural and linguistic differences can add an extra layer of difficulty in
assessing how representative and balanced a multilingual comparable corpus is.
Not to mention the global status and presence of different languages in and out of the Internet. When it comes to balance, which has been traditionally associated with proportionality (Leech, 2007), more difficulties arise. Proportionality can be a rather tricky concept as it can refer to multiple aspects of a corpus: same genres in all corpora, same number of tokens or texts, same sources. In all of these dimensions, it is assumed that there is an equal availability of textual typologies and similar status across domains in all languages and cultures covered by the sub corpora, which does not necessarily hold truth. The issue can be even more pressing when we are dealing with highly interdisciplinary domains, such as migration and asylum, in which differences in national legal systems, migratory flux, and funding can shape the production and availability of supporting materials for asylum-seeking officers and claimants, for example. From that, another challenge arises: determining the authenticity of texts, especially when they are produced in multilingual international institutions, where determining the authorship of texts and their translation(s) can be difficult. By opting to build a corpus with material available on the Internet (Seghiri, 2011), another issue that arises is the presence of that language on the web and how this is reflective, and therefore representative, of the offline world. Another complicating factor is the power balance among languages of the world, i.e., English dominates the Internet, has more financial power to foster more written
language production, while other languages are unable to compete (Prado, 2012).
In the field of migration, the use of incorrect terminology can lead to misunderstanding and even culminate in refusal of entry. Therefore, encoding information such as regional variation in both the corpus and the lexicographic materials is not only desirable, but necessary to represent the domain properly. In this study, we discuss representativeness and balance of the Multilingual Corpus on Migration and Asylum (COMMIRE) (Furtado & Teixeira, 2022), a specialized comparable corpus of Portuguese, French, Spanish, and English with over 1 million words in each language. COMMIRE is a multi-genre, multilingual, and multi-variety corpus whose primary goal is to serve as the basis for the
Multilingual Glossary on Migration and Asylum (Furtado, 2019).
Traditionally, representativeness in specialized language corpora has been tackled qualitatively by collecting the most reputable and crucial texts belonging to a domain by following the opinion of experts or existing knowledge taxonomies. Although efficient, the method is costly and, when dealing with multiple languages, the experts are not always easily identifiable or accessible. An alternative has been to go above and beyond collecting as much text as possible, making it hard to assess when enough is enough, and adding increasing challenges for cleaning, preprocessing, storing, and analyzing the corpus in its full extent. To mitigate this issue, automatic metrics, for example, the type-token ratio (TTR), have been proposed to evaluate the balance of corpora (Seghiri, 2014).
While the provision of corpora for multiple domains has increased throughout time, few studies touch upon extending existing specialized, multilingual, multigenre comparable corpora for lexicographic purposes. The benefits of exploiting comparable corpora have been consistently demonstrated in the literature (McEnery & Hardie, 2012; Stefanowitsch, 2020; and many more), while multilingual lexicography has always been neglected due to the challenging nature of creating a unified lexicographic project to encompass rather heterogeneous nuances. Indeed, it is unthinkable to build up-to-date lexicographic resources without a corpus nowadays; yet, building and expanding monitor specialized
multilingual corpora remains a challenge. This study is the first step to address this issue and suggests a path that can be potentially useful to other multilingual resources.
In this ongoing case study, we hypothesize that representativeness and balance for multilingual lexicography purposes should be first pursued in one of the languages of the corpus, chosen to be the departing one. Then, a comparison of keyword lists across languages might be a good departing method for assessing corpus preparedness to be used as a sound source for equivalents, as suggested by Tagnin and Teixeira (2008). Finally, the process is repeated taking each language as the departing one. This means, as suggested by these authors, that the entry list for each language might be slightly different at the end, reflecting these languages and cultural differences.
Medicine is one of the specialized domains that is of particular interest to different communities of speakers, most of whom cannot be considered experts or semi-experts. Their interest in the domain lies in the fact that a certain level of medical knowledge is needed in everyday life, much like a basic understanding of legal concepts. As a prominent characteristic of the domain, terminological variation has been extensively studied (Bowker & Hawkins, 2006; Lončar & Ostroški Anić, 2013; Tercedor Sánchez & López Rodríguez, 2012, to name a few), focusing mostly on the differences in expertise levels among different speakers, i.e., medical experts and laypeople. The more precise, concise, and systematic the discourse is, the greater the term density and the less term variation. As the degree of specialization decreases, specialized discourse becomes more similar to general discourse in terms of conceptual variation, redundancy, ambiguity,
and the extensive use of synonyms and paraphrases to explain concepts (Cabré Castellví, 1998, in Freixa, 2006). The degree of text specialization causes variation in defining concepts, often referred to as contextual variation (San Martín, 2022), conceptual variation (Freixa & Fernández-Silva, 2017) or vagueness in general language (Geeraerts,
1993). San Martín (2022, p. 2) argues that the context determines the exact meaning of the term in that context, i.e., “the term invokes the same concept, but the activated knowledge differs.” Medical concepts related to diseases, conditions, treatments, procedures, etc., are defined and described differently in different contexts and registers, depending on the intended users. The meaning or concept characteristics remain the same, but different characteristics are highlighted depending on the focus of the communicative setting: the cause of an illness, its symptoms or methods of treatment. In other words, if we regard a disease as a semantic frame with its frame elements (FEs), different elements are in focus depending on the context and the user, which also means that the situation is framed differently by using different terms or term variants. Traditional
terminological or analytical definitions, which consist of the superordinate concept and the defined concept’s delimiting characteristics, will therefore often be replaced with types of definitions that exploit other knowledge patterns, e.g., functional or synonymic (Sierrra et al., 2008). To compare the structural and conceptual differences in the definitions of medical concepts in texts of different registers and levels of expertise, we compared the definitions of terms for 50 diseases found in two corpora of medical texts in Croatian. The aim of this paper is to establish the most common definitional patterns used to define the concepts referring to different diseases in texts for non-experts. The corpora used for the analysis and extraction of patterns had been previously compiled with Sketch Engine tools within the work done by Ostroški Anić & Brač (2022): a scientific corpus of research papers (5,318,395 tokens) and a corpus of texts taken from medical portals with the general public as their intended audience (5,022,639 tokens). All further analysis and definition extraction were also conducted using Sketch Engine tools.
A list of 50 terms for diseases was first established based on concordances of the Croatian term bolest ‘disease’ in the corpus of texts from medical portals. The concordances of each term were then manually inspected, and five definitions per term were selected for annotation. When concordances exceeded 1000 hits, a random sample of 300 was taken. The same list of terms was then queried in the scientific medical corpus using the same procedure. The definitions were then annotated following the Framenet methodology (Ruppenhofer et al., 2016) for the frame elements of the frame Medical_conditions. In addition to annotating FEs, we determined verbal patterns and their lexical markers following the list of markers in Sierra et al. (2010). Table 1 shows two examples of terminological definitions from the two corpora, with the superordinate concept underlined. The definitional patterns established will be used for extracting definitions of other medical concepts to create a dataset of expert and non-expert definitions of medical concepts. The dataset and the typology of definitional patterns will be used for text simplification in order to create popular terminological resources for
non-experts and the general public.
Electronic financial glossaries have proliferated in Spain and Spanish America, as well as in other world regions, with the surge of globally encouraged national financial education strategies over the last couple of decades (OECD/CAF, 2020). Among the resources available on the websites of financial institutions such as central banks and commercial banks, monolingual glossaries emerge with the aim of explaining financial and economic concepts in plain language to non-expert users. They are thus what the Function Theory of Lexicography dubs knowledgeoriented lexicographic tools (Bergenholtz & Tarp, 1995; Fuertes-Olivera, 2010; Fuertes-Olivera & Tarp, 2014).
Recent investigations in this field (Rocha-Ochoa, in press) show that glossaries developed by central banks in Spanish and Portuguese-speaking countries are enormously heterogeneous in terms of their size, structure and thematic content, and do not always reflect their intended goals. In addition, they limit themselves to a rather basic use of internet technologies and lack actualization. Commercial banks have developed glossaries as part of financial education initiatives, albeit their goals and motivations might differ from those of central banks (Kalmi & Ruuskanen, 2022), but such lexicographic resources remain unexplored in the scope of lexicographic enquiry.
The four Spanish-speaking countries represented in this study –Argentina, Chile, Mexico, and Spain– offer a set of comparable financial education materials, given that the eleven selected glossaries stem both from central banks and from local branches of two commercial banks: Santander and BBVA. Over thirty six hundred lemmas and definitions were classified, analysed and compared. The comparative analysis revealed tendencies among countries and institutions regarding the arrangement, thematic domain and level of specialisation, as well as in the proportion and of acronyms and anglicisms. For example, commercial bank glossaries tend to present a more polished structure and contain more
IT terminology and more anglicisms than those from central banks. Similarly, although the presence of acronyms might be higher in central bank glossaries, such abbreviated units fall within different domains from those found in commercial bank glossaries. The findings not only underscore the variation across banks and countries, but also shed light on diatopic variation within financial terminology in Spanish.
After the Croatian national termbase, Struna ceased to receive funding in 2019, we began developing a novel model for compiling terminological collections that will not rely on field experts to provide initial terminological information. A potential solution to our issue of finding a practical and dependable source for obtaining information in the initial stages of processing terminology (i.e., the ‘raw definitions’) across multiple domains could be the
publicly available AI language model developed by OpenAI known as GPT-4.
GPT is a substantial language model that offers a range of capabilities, including answering queries, generating text, and executing tasks like translation and summarization. A custom GPT is currently being devised as an aid module, delivering unprocessed information for terminological units that will be processed in Struna. The initial training phase involved manually providing guidelines for best practices in terminology management, which were designed based on the well-established and successful methodology we used to train field experts in the past. The second phase involves feeding TermAI with modified data that was exported from Struna. In this paper, we will present the results of the comparative analysis of generated terminological units from TermAI and field experts in the domain of forensic sciences.
Large Language Models (LLMs) have been dominating the discussion fora on language technology for at least the past seven years. As much as LLMs have spurred progress in NLP, recent research has been demonstrating their performance seems to reach a limit which cannot be overcome with more training data. Therefore, hybrid approaches combining LLMs and Language Resources have been gaining momentum. In this talk I explore possible futures for research in semantic lexical resources in combination with LLMs and AI techniques. As examples of possible research paths, I discuss the application of the FrameNet model to the development of a tool for identifying territories prone to suffer from gender based violence, as well as to the growing field of multimodal NLP.le in polite company.
Among historical and ancient languages, usually under-resourced due to the limited size of corpora and the scarce availability of digital lexical resources, Latin is relatively well documented, thanks to its high relevance to the history of Europe and to the study of Romance languages. As far as lexicography is concerned, several lexical resources are available in digital format, although this is not true for all periods and registers. Indeed, Latin is attested across an extremely long time span, ranging from the 3rd century BC up to the present day – as it still is the official language of the Vatican City State.
Lexical resources for Latin include bilingual dictionaries (Glare, 2012; Lewis & Short, 1879), thesauri (the Thesaurus linguae Latinae – 1900–), and glossaries (Cange et al., 1883–1887), covering different eras from Early (3rd BC – 2nd BC), to Classical (1st BC – 2nd AC), Late (3rd AC – 6th AC) and Medieval Latin. However, for the variety attested at later stages, usually referred to as Neo-Latin, the situation is quite different: much less research has been conducted, and there is no reference dictionary devoted to this period. The Neulateinische Wortliste (NLW, cf. Ramminger, 2016), a work-in-progress lexicon compiled by Johann Ramminger, collects lexical entries for more than 21,000 words attested from the 14th to the 18th century AC in texts written by humanists in a style closely resembling Classical Latin.
This resource was made available only through a web interface (http://nlw.
Renaessancestudier.org/neulateinische_wortliste.htm), with no access to source data, which were kindly provided to us by the author for the purposes of this work.
Another issue is the lack of interoperability between the NLW and other related resources. This limitation prevents, for instance, from retrieving in the several available Latin corpora the textual occurrences of the lexical items listed in the NLW. Instead, making linguistic resources interoperable means to make their (meta)data interact on the Web by using shared communication protocols, data categories and ontologies, thus addressing the so-called FAIR principles of data management (Wilkinson et al., 2016). To meet such need, a current approach to interlinking linguistic resources takes up the so-called Linked Data principles, so that “it is possible to follow links between existing resources to find other, related data and exploit network effects” (Chiarcos et al., 2013, p. viii).
According to the Linked Data paradigm, data in the Semantic Web (Berners-Lee, Hendler & Lassila, 2001) are interlinked through connections that can be semantically queried, to make the structure of web data better serve the needs of users. Data in the Semantic Web are represented according to the Resource Description Framework (RDF) data model (Lassila et al., 1998), where information is structured in terms of “triples” that connect a “subject” to an “object” through a predicate (“property”), and relations between items are expressed by assigning them to “classes” and “subclasses”. Over the last decade, the research community working in the area of Linguistic Linked Open Data (LLOD) has developed several standard de facto ontologies to represent linguistic, and especially lexical, information stored in resources.
In this work, we describe the steps undertaken to publish the NLW as LLODand to connect it to the LiLa (Linking Latin) Knowledge Base (KB) of interoperable linguistic resources for Latin, which was recently created following the principles of the Linked Data paradigm (https://lila-erc.eu). Linking the NLW to the LiLa KB makes the (meta)data provided by the NLW accessible and retrievable through federated querying across interoperable resources, including those for languages different from Latin, thus, easing crosslingual investigations. To model the information provided by the NLW at different levels (morphological information, sense(s) accompanied by a translation in German,
attestations), we rely on Ontolex, a widely used vocabulary that is by now established as a de facto standard for the release of lexical resources as LLOD (https://www.w3.org/2016/05/ontolex/).
After describing the process of linking the NLW to LiLa, we provide examples of the advantages that having this resource published as LLOD yields for both the scientific community of scholars interested in Neo-Latin, and for the one interested in lexicographic research. In particular, we present a few federated queries run across the NLW and other both lexical and textual resources currently interlinked in the LiLa KB. Finally, we propose a lexically-based comparison on how lexical entries are described in different lexical resources, such as a word list and a full-fledged dictionary. With the help of a few case studies, we show that the same modelling strategy may account for different lexicographic choices in terms of hierarchical structure of the lexical entries and subentries. If we take the case of substantivized adjectives, we observe that they are treated differently:
either as entries or sub-entries of a superordinate lexical entry. Thanks to the OntoLex module for Lexicography (lexicog:Component and lexicog:describes; see https://www.w3.org/2019/09/lexicog/), we can compare how different resources account for lexical items of this kind.
While there has been a number of projects focusing on early medieval Irish lexicography (Griffith et al., 2018), few have aspired to work towards comprehensive interlinking of textual and lexical resources. This is at least in part due to the morphological complexity and variation in Early Irish (c. 600–1200CE), compounded by the absence of an orthographic standard (Stifter, 2009). The resulting lemmatic variation in legacy resources — compare Early Irish deponent molaithir/molaidir [ˈmoləðʲərʲ] against active molaid [ˈmoləðʲ] ‘praises’ — leads to substantial challenges around effective deployment of currently available lexical resources and justifies a unified collection of canonical forms to
interconnect resources — serving scholars of Early Irish in the first instance, but also benefitting, through potential future interlinking of Early Irish and modernlanguage resources, scholars of more contemporary stages of the language interested in diachronic change and etymology. This paper focuses on the development of a Lemma Bank for Old Irish (c. 600–900CE) as part of the MSCA-funded MOLOR project — Morphologically Linked Old Irish Resource —, which aims to interlink lexical resources for this language period, including the novel lexical resource Goidelex (Anderson et al., 2024) and an inflected lexicon, whose paradigms are generated from lemmas in Goidelex. Methodologically, the current work takes inspiration from the project LiLa: Linking Latin (2018–2023), whose objective was to interconnect distributed (lexical and textual) resources and NLP tools for Latin using the Linked Data paradigm — the use of shared ontologies, data categories, communication protocols and technologies such as the Resource Description Framework (RDF) (Wood et al., 2014) to enable federated, semantic querying over heterogeneous resources, ultimately resulting in what Tim Berners-Lee has called the Semantic Web (Berners-Lee et al., 2001). The adoption of the Linked Data paradigm automatically ensures adherence to the so-called FAIR principles of data management (Wilkinson et al., 2016). As part of the LiLa project, a Lemma Bank has been developed, which was conceived — and texts for Latin (Passarotti et al., 2020).
The current contribution will report on the design challenges and choices in populating an Old Irish Lemma Bank with canonical forms from legacy and novel resources, striking a balance between, on the one hand, linguistic granularity and, on the other hand, a workable amount of lemmas, adopting extensions to the OntoLex model (McCrae et al., 2017) made in LiLa to cater for the existence of divergent pprox.ngn criteria (i.e., different canonical forms for the same lexeme) on the basis of morphological/inflectional variation. The launch version of the MOLOR Lemma Bank as a Linguistic Linked Open Data resource is expected to contain at least 599 orthographically pprox.n Old Irish noun lemmas (out of an estimated total of 4000–4500), extracted from Goidelex, whose
principled indexing of inflectional (and, hence, lemmatic) variants allowed for straightforward mappings (Fransen et al., 2024). Since the initial focus in Goidelex has been on nouns from one corpus, verb lemmas were instead comprehensively collected from three (less linguistically granular and structured) legacy resources and manually pprox.ng according to inflectional class, resulting in approximately 1300 lemmas. More POS categories will be added in due course.
CHAMUÇA (Cultural HeritAge and Multilingual Understanding through lexiCal Archives) is a pioneering initiative aimed at exploring the impact of the Portuguese language on Asian languages, rooted in the historical exchanges between Portuguese traders, colonists, and diverse Asian cultures. The impact of these interactions extends beyond historical remnants to the modern-day lexicon of Asian languages, which includes a diverse array of Portuguese borrowings, ranging from general vocabulary units to specialised units. We aim to detail the initiative’s current status, its goals, and the methodology it employs. Additionally, it will outline the essential steps required for organising and structuring the
knowledge embedded within and associated with the borrowings. CHAMUÇA, an innovative open-source resource designed to document and study these Portuguese linguistic contributions, will augment the pool of structured lexical data and support cross-linguistic analysis, using state-of-the-art frameworks such as OntoLex-Lemon and TEI Lex-0 to structure the lexical data. Following FAIR principles – ensuring data is findable, accessible, interoperable, and reusable –, CHAMUÇA is poised to contribute to linguistic borrowings, cultural interchange, and the preservation of linguistic heritage. Furthermore, the project will encourage community involvement and scholarly collaboration to evolve and enrich its contents, leveraging collective expertise to illuminate the nuances of
language contact phenomena.
In the revision process of dictionaries, adding new headwords or new senses to already existing headwords is what typically receives the most attention. In this article, we bring into focus the intriguing dilemma of exclusion of headwords from the Swedish Academy Glossary (SAOL), which is still published in print versions. In the e-dictionary-era, removing headwords may seem questionable, SAOL is, however, a contemporary dictionary which aims to reflect present-day Swedish. In order to keep the lemma list up to date, new headwords are added and obsolete words are removed. The editors of SAOL have practised lemma exclusion in connection with the revisions of new editions for almost 150 years. In this paper, we present SAOL and argue that lemma exclusion is crucial to SAOL’s aim and target group. We also present our most recent corpus material, methods and tools included in this process.
In the introductory part of the presentation, the authors will present the Croatian Web Dictionary – Mrežnik project (Hudeček & Mihaljević, 2020; Hudeček, Mihaljević & Jozić, 2024). Mrežnik consists of three modules – the module for adult native speakers of Croatian, the module for students and the module for non-native speakers learning Croatian. These modules have different approaches to semantic relations. Entries, entry words, and subentries are connected by links according to various criteria (feminine/masculine pairs, word formation, verbal aspect, semantic relations; Hudeček & Mihaljević, 2019).
This presentation will focus on recording these semantic relations in Mrežnik: synonyms, antonyms, hypernyms, hyponyms, co-hyponyms, meronyms, and holonyms (Cruse, 2006; Murphy, 2016). These relations and different approaches to them in different modules are illustrated by a number of examples from Mrežnik. Complex synonymic and antonymic relations are shown using the example of two Croatian near-synonyms, voljeti (‘to love’, but also ‘to like’), ljubiti (‘to love, to like’, but also ‘to kiss’), and mrziti (‘to hate’). Although native Croatian speakers would mostly say that voljeti and ljubiti are synonyms and mrziti is their antonym, after corpus analysis with word sketches and word sketch differences in SketchEngine, a complex network of synonymic and antonymic relations emerges, as all these words have several closely related meanings and establish the relations of synonymy and antonymy only in some meanings. In some meanings, they are also connected to other verbs. The approach to hyperonymy, hyponymy, and co-hyponymy in Mrežnik is illustrated by entries of terms that denote case (Nominative, Genitive, Dative,
Accusative, Vocative, Locative, Instrumental). The hyperonym is the term case, which occurs in all seven case definitions. The definitions are similar in structure and differ only in the element of number (e.g., first case, second case, etc.) and the questions asked for each case. Each case has six other cases as co-hyponyms. These co-hyponyms are connected by links. The approach to entries denoting case is compared with the approach in the terminological database Struna (Nahod, 2023, pp. 3–10) and the Dictionary of Croatian Linguistic Terminology (Mihaljević & Hudeček, 2024).
Although meronyms and holonyms are regularly mentioned in works on semantics, lexicology, and lexicography, most dictionaries do not record meronymy-holonymy relations. This is not surprising, as “meronymy is generally not as central as other -onymic relations, since it is not a logical relation” (Murphy, 2016, p. 446). At the beginning of the work on Mrežnik, the linking of meronyms and holonyms was considered. This will be illustrated using the example of days of the week (tjedan: ponedjeljak, utorak, srijeda…) and sentence elements (rečenica: pprox., predikat, pprox…). However, the reasons for moving away from this idea and using a different approach to meronyms will be explained. The meronymy-holonymy relation is reflected in the collocation block (introduced by
the collocational question Što x ima? ‘What does x have?’), but sometimes also in definitions beginning with the word dio (‘part’) or član (‘member’). However, the authors will show that in some cases it was difficult to show this relation consistently for all parts and sometimes even to distinguish it from the hypernymhyponym relation.
The investigation of semantic relations in Mrežnik has raised several questions: Are semantic relations important to the users and if so, why? Are there semantic relations between functional word classes (syntactic synonyms)? Is synonymy only possible between members of the same word class? These questions will be answered in the presentation. The advantages of a born-digital dictionary in representing semantic relations as compared to printed dictionaries will be highlighted and the benefits of this approach to semantic relations for the dictionary user and the dictionary compiler will be explained. The linking of synonyms makes it possible to insert a larger number of words. It also encourages the compiler to consider whether the same or a modified definition should be used. The same applies to antonyms. If two words are antonyms, they usually have a similar definition and differ only in a single element. All co-hyponyms have the same hypernym. The advantages for language users are that they can become aware of words that belong to the same semantic field, expand their vocabulary, become aware of the lexical system, and systematize their linguistic knowledge. Some results and solutions used in Mrežnik were presented to the target groups of users (especially non-native speakers learning Croatian, Hudeček, Mihaljević & Pasini, 2024) and their suggestions were taken into account. Further research with non-native speakers of Croatian is planned. The semantic network of relations and its lexicographic treatment in Mrežnik is compared with the semantic relations in Croatian Linguistic Terminology – Jena (Mihaljević, Hudeček & Jozić, 2023), a terminological database created simultaneously in the same institution and partly by the same authors. The research of semantic relations in Mrežnik, already applied to the Croatian Linguistic Terminology – Jena project, could be a model for future dictionaries of (South) Slavic languages.
This article presents semantic information about contemporary standard Slovenian on the Franček educational language portal, which is aimed at primary and secondary-school students. The portal’s primary role is to enhance students’ dictionary skills as part of the national language education program and to introduce users to other linguistic resources, such as school grammars. The portal offers a user-friendly environment tailored to meet the needs of its target users. Emphasis is placed on its innovative design and user-centric approach, which facilitate intuitive learning and engagement with lexicographic content. The lexicographic content of the Franček portal is organized in eight content modules altogether, and the semantic information discussed here is presented in meaning, synonymy, and phraseology modules. The article discusses the decision-making process, technical issues, and visualization strategies for adapting semantic information to three distinct age groups.
This research proposes a step forward in the automatic identification and analysis of verbal idioms in Croatian. The use of the NooJ automated text processing tool, along with the MaCoCu corpus and the Online Dictionary of Croatian Idioms (ODCI), provides a robust framework for recognizing and categorizing these multi-word expressions (MWEs). The research comprises two parts: (a) creation of a dataset by utilizing the ODCI that allowed for a set of 898 verbal idioms to be compiled and annotated with linguistic features, including structure, morphological features, and variation patterns; (b) analysis of extracted data that provides insights into the lexicographical and linguistic significance of the idioms, such as variability, modification, and frequency of use. The study highlights the challenges posed by idiomatic variations and the verb’s role as the most variable component in idioms. For instance, the idiom “soliti pamet komu” (to give unsolicited advice) is often modified for expressiveness, such as in the phrase “having a big saltshaker to salt everyone’s mind.” The dataset aims for lexicographic integration into ODCI and supports the creation of electronic language resources. It also contributes to theoretical and cross-lingual research, with the CLARIN repository expected to enhance data reusability in NLP. The study’s findings offer a deeper understanding of verbal idioms’ dynamics and their computational processing.
The identification of metaphors provides a strong foundation for further metaphor research. Since Lakoff and Johnson (1980) introduced the Conceptual Metaphor Theory, most research has focused on metaphor’s conceptual and cognitive aspects, often relegating the linguistic dimension. Much of this research has relied on researchers’ intuition rather than systematic analysis. However, metaphor identification procedure MIP (Pragglejaz Group, 2007) and its elaborated version MIPVU (Steen et al., 2010) introduced a systematic and reliable methodology for identifying linguistic metaphors. Both procedures compare basic and contextual meanings to determine metaphorically used words. Whether contextual and basic meanings are distinct enough is measured by their degree of independence as separate meaning descriptions in the dictionary, which have become integral to metaphor identification. Any meaning not found in these dictionaries is considered novel. A slightly modified version of the MIPVU approach, incorporating similar modifications as outlined in Bogetić, Broćić, and Rasulić (2019), was employed to manually annotate a small corpus in the Slovene language (Antloga, 2020). We referred to the Dictionary of the Standard Slovene Language (SSKJ) when determining the meaning of a lexical unit. During the process, we started to pay more attention not only to sense disambiguation but also to how identified metaphorical expressions are represented in the dictionary and the way dictionary entries reflect identified conceptual structuring. We expanded our analysis through manual inspection of entries, definitions, and observations in other lexicographic resources, such as the Thesaurus of Modern Slovene, The Collocation Dictionary of Modern Slovene, and The Dictionary of Slovenian Transitive Verbs. We only considered indirect metaphorical expressions since direct language use cannot be captured by contrasting basic and contextual meanings and should, therefore, be approached without the use of dictionaries. We employed a bottom-up approach to first identify conceptual mappings from annotated indirectly used metaphor expressions (like absorb the wisdom of others). We later observed the metaphor scenario, namely what kind of information about the conceptual structure (absorb what) of the metaphor (BODY IS A CONTAINER and KNOWLEDGE IS A SUBSTANCE) we can extract from the dictionary entry of the literal and metaphorical meaning(s), for example, from the literal meaning (to absorb ‘to take in by sipping’) omitting the manner in which the absorption is done when defining metaphorical meaning (‘to take in’), indicating a level of abstraction by eliminating the human activity. The Collocation Dictionary of Modern Slovene supports the existence of a metaphor QUALITIES ARE SUBSTANCES (and therefore KNOWLEDGE IS A SUBSTANCE) by listing collocations to absorb knowledge, to absorb beauty, to absorb positive feelings. During the analysis of individual lexical units, we made some observations regarding the treatment of metaphorical expressions in the dictionary (Table 1).
he semantics of body part nouns is particularly fascinating from the point of view of the evolution of word meanings, their metaphorical and metonymic derivation and, from a cross-linguistic standpoint, for the large amount of overlap between different languages. The status of BODY as a semantic prime, especially in the Natural Semantic Metalanguage (NSM) (cf. Wierzbicka, 2014; 2007) has never been challenged by recent studies in lexical semantics. In combination with further NSM primitives such as the relational substantive PART and space substantives such as BELOW and SIDE, BODY is the fundamental building block of complex semantic units such as ARM, HEAD, NECK, EYE etc. Wierzbicka (2007, p. 15) argues that “the domain of the human body is an ideal focus for semantic typology and, […], cognitive anthropology, because the body is, almost certainly, a conceptual […] universal. […] despite a good deal of variation in the lexical details, everywhere in the world people appear to think about the human body in terms of certain parts […].” The “good deal of variation” makes this topic an interesting testing ground for lexicography. Intralinguistic analysis, especially when supported by historicaltypological evidence, highlights a productive area of pprox.ngnio pprox.ngn which appears difficult to pinpoint. The cognitivist perspective helps us to better grasp the mechanisms of conceptualization and reasoning behind this phenomenon. As pointed out by Manerko (2014) in a study of idiomatic expressions in English, cognitive modelling of the human body involves the classification of its parts along a restricted number of topological classes which may integrate different image schemas, e.g., the ‘container’ one (cf. Lakoff & Johnson, 1980). Cross-linguistic analysis reveals common paths of pprox.ngnion and pprox. ngn that sometimes extend beyond the concrete, basic meaning of a word. It also provides crucial insights into the phraseology of body part nouns in view of their lexicographic treatment. The phraseological expressions at the core of the study are word combinations with a relatively high degree of fixedness and partly non-compositional meaning. The nouns we would like to discuss in the three languages Italian, English and German are mano/hand/Hand and braccio/arm/Arm because of the wide overlap of concrete and abstract meanings. Partial equivalences between further nouns will also be mentioned. Our case study starts from constructionist and phraseological-constructionist lexicological studies carried out on different languages in the quest for shared syntactic-semantic features in the creation of phraseological expressions centred on body part nouns (cf. Ganfi et al., 2023). A lexicological analysis focusing on the semantic relations that involve various body part nouns in their different phraseological manifestations, e.g., SPATIAL ORIENTATION and POSSESSION in the case of hand, provides a baseline for an in-depth understanding of the pprox.ngnion processes that may apply to different phraseologisms, even across languages, and offers a methodological foundation for their lexicographic treatment. The application of specific semantic relations as well as of generic image schemas to the treatment of phraseological expressions, in particular of idioms, in a lexicographic resource is treated with reference to the dictionary model named Phrase-based Active Dictionary (PAD) (cf. DiMuccio-Failla & Giacomini, 2022), which is part of PhraseBase, an integrated lexical information system for language learners made up of a lexicographic, an ontological and a grammatical component. After presenting the microstructure of a PAD, we will show which positions within a lexicographic entry are dedicated to the description of idioms, especially when a non-compositional phraseological meaning coexists with a literal (compositional) meaning (e.g., to get your hands on something). We will also illustrate which semantic-pragmatic indications are provided to the ideal user (an advanced learner) regarding these expressions. The case study highlights, from a cross-linguistic angle, the influence of cognitive processes in determining the complex semantic properties of phraseology related to body part nouns. In general, we argue that this type of knowledge, which is assumed to play a significant role in almost any semantic field, should be explicitly made available to dictionary users (cf. Ostermann, 2015), especially non-native speakers, and that it should be conveyed by means of verbal and non-verbal contents.
What strategies are currently being applied in electronic dictionaries and terminology databases to gender representation, with a particular focus on feminine agentives? Starting with an overview of the state of the art as to gender studies in lexicography and terminology, in this paper we reflect upon the analyzed approaches in electronic dictionaries and terminology databases, collecting both good practices and concrete limitations that hinder an equal gender representation by contrasting different linguistic and socio-cultural contexts. Since lexical resources can be considered not only utility instruments but also “agents” that reflect society, there is a risk that they contribute, for example, to perpetuate gender stereotypes. Thus, it is important for lexicographers and terminologists to discuss and establish methods for avoiding gender bias in lexical resources.
The paper deals with vocabulary related to age and age groups, its semantic and pragmatic characteristics, and challenges it can pose for lexicographic description. The inventory and treatment of lexical items from the domain of age in contemporary Croatian general dictionaries and in the most comprehensive contemporary Danish dictionary are analysed regarding the semantic and pragmatic information provided, as well as the presence of age stereotypes or strategies to mitigate them. In Croatian dictionaries, many terms denoting older people or phenomena relevant to them are not listed. Semantic and pragmatic information, such as connotations or appropriateness, is incomplete and inconsistent, and examples reflect common stereotypes of young and old people. In the Danish dictionary, the inventory of lexical items is more comprehensive, the descriptions are more detailed, and pragmatic information is commonly provided. The examples are mostly realistic and neutral. However, a comparison with entries related to some other social groups indicates a slightly higher tolerance for age-related stereotypes.
By the end of 2019 Merriam-Webster announced the singular gender-neutral personal pronoun “they” the most frequently looked-up word of the year. In fact, the non-binary sense of this pronoun and its nonstandard reflexive form “themself” were added to the dictionary only a few months before, in September 2019. The aim of the paper is to analyse the lexicographic treatment of the singular pronoun “they” as well as its objective, possessive and reflexive forms a few years after the non-binary sense was entered in dictionaries. The paper will also briefly look at gender-neutral and non-binary pronouns recently added to the lexicons of other languages (see e.g., Renström, Lindqvist, & Gustafsson Sendén, 2021) and their lexicographic representation: “hen” in Swedish and Norwegian, “hen” (a borrowing from Swedish) and “xier” in German, ”hen” and “die” in Dutch, “iel” in French, and “elle” in Spanish. To account for gender fluidity, the present study has been conducted in the theoretical framework of Cognitive Linguistics, where conceptual categories are viewed as non-discrete, and social concepts are not static, but gradable and dynamic (Geeraerts, 2016). The entries for they, them, their and themselves/themselves are examined in monolingual English online dictionaries, four monolingual learners’ dictionaries: Cambridge Advanced Learner’s Dictionary, COBUILD Advanced English Dictionary, Longman Dictionary of Contemporary English, and Oxford Advanced Learner’s Dictionary) and three monolingual dictionaries for native speakers: one comprehensive Oxford English Dictionary, and two standard ones Merriam Webster and Collins English Dictionary. In order to gain a wider perspective, the study also uses Google’s English Dictionary, the crowd-sourced Wiktionary and the latest innovation in lexicography, ChatGPT as a point of comparison for the information collected from the dictionaries. Two main components of the dictionary entry are subjected to scrutiny and evaluated: definitions and examples of usage, and in monolingual learners’ dictionaries, also usage notes, as these are of paramount importance when it comes to raising awareness of this significant socio-linguistic issue in a language that is not the mother tongue. Another criterion of analysis is the number of senses distinguished, their ordering and arrangement. It is checked if the nonbinary sense is lumped together with the gender-neutral one, or presented as a separate sense or subsense. It is also investigated if the non-binary sense is distinguished in the entries for all the inflected forms of the pronoun, or just the nominative form. Definitions of the gender-neutral sense of “they” are also analysed to see if they comprise only binary forms such as “he or she”, “him or her”, “his or her”, or “male or female” instead of opting for the more inclusive form, e.g., “a person”, or adding the third gender to the spectrum of possibilities. Additionally, the dictionary definitions of the words “gender” and “sex” are looked into, as these lexemes are often used in the definitions of the gender-neutral and non-binary senses of “they” and its forms. Moon (2014: 98) notes that all the big four online leaners’ dictionaries provide purely male-female binary explanations of gender, and she recognises the need for acknowledgement of changing sociocultural paradigms to be reflected in lexicographic definitions. As far as example sentences are concerned, it is investigated if examples of usage of the non-binary sense are presented, and if they contain the antecedent in the singular form to illustrate the singular reference of the pronoun, and if the form of the verb used after the anaphora clearly indicates that the verb must be plural, avoiding forms which might be ambiguous, such as past tense forms of verbs other than to be. The results indicate that the examined dictionaries are not equally inclusive regarding the treatment of the non-binary and gender-neutral senses of “they”, and some still follow the traditional distinction in defining “gender” or “sex”. In fact, ChatGPT turns out to provide explanation which is the most comprehensive regarding gender identity. As LGBTQIA+ people define themselves through language, it is the role of the dictionary to provide adequate and inclusive explanation of non-binary forms.
Historical language data can give us an insight into the conceptual and everyday world of past times. However, this insight very often only related to a small group of the society with a strong political and social influence. What the linguistic and social situation looked like for the majority of the population can usually only be guessed through the interpretation of others, as only a small proportion took part in the written production, that is available to the posterity. It was only in the 19th century that dialectology, which was still young at the time, showed an increasing interest in the language used by the rural society. For a representation of the language, the less mobile, less educated and manually working peasant class (often referred to as NORMs or NORFs) was usually chosen, as they were assumed to produce the “most original” and “purest” dialect. Attempts were also made to explain or understand historical language change processes based on recent dialectal conditions. In addition to the language atlas projects, which adhered to this methodological paradigm until the second half of the 20th century, the dialect dictionaries played (and still play) a central role in the documentation and preservation of dialect forms. A prominent example, which also provides the basis for our presentation, is the “Dictionary of Bavarian Dialects in Austria” (‘Wörterbuch der bairischen Mundarten in Österreich’, WBÖ). The project looks back at a long history, starting in 1912. In line with the academic study of dialect and the close interlinking of dialectology and ethnology, the aim was to gain an “insight into the entire imaginative and emotional world of dialect” (‘Blick in die gesamte Vorstellungs- und Gefühlswelt der Mundart’). (Seemüller & Much, 1911). The main interest was therefore in the documentation of the everyday vocabulary of the rural population in the first half of the 20th century. To this end, a comprehensive collection of material was started in 1913, most of which was collected over the next few decades with the help of so-called volunteer “collectors” (‘Sammler’). The survey was based on “questionnaires aimed at collecting dialect vocabulary and gaining knowledge of the associated factual and folkloristic material” (‘Fragebogen, die in gleicher Weise darauf hinzielen, den Mundartwortschatz zu sammeln sowie Kenntnis von dem dazugehörigen sachlichen und volkskundlichen Gut zu gewinnen’) (preface of WBÖ Vol. 1: XVI). Since the foundation of the WBÖ and the data collection, a lot of time has passed, and society has undergone major social changes. In Austria, for example, today only 3.5% of the population works in the agricultural sector, compared to almost a half at the beginning of the 20th century (cf. Zeitlhofer, 2011, p. 46; Statistics Austria). If we look at the linguistic material of the WBÖ from today’s perspective, there are some considerable differences to the way the language is used today. Many terms from the historical rural context seem unfamiliar, are not known anymore, their use seems inappropriate or discriminatory. In our presentation, we will systematically examine the material of the WBÖ for negative characterizations and designations of persons. For the most part, these pejorative terms are created by expanding the basic meanings of lexemes and transferring certain prototypical characteristics to persons. These semantic transfers are often metaphorical (cf. examples 1 and 2), but in some cases they can also be extended metonymically (cf. example 3). 1. 2. 3. Geige (fiddle) ‘mocking term for a long, lean girl’ Gocke (corn) ‘misshaped person’ or ‘stupid person’ Fetzach (rag) ‘worthless woman’s dress (derogatory)’ -> ‘careless woman’ Some of the material generated goes back directly to or emerges from the survey. Other content was expressed immediately or arose through associations, which is partly due to the survey’s lack of systematicity. Contents that the rural society was aware of were deliberately asked, while the authors of the dictionary came from the middle-class milieu and the incredibly detailed survey encouraged both collectors and informants to discuss other topics that were still unknown to science. The question is examined as to which areas the respective designates primarily come from, which triggers can be found and which processes take place during a semantic transformation. In addition, we will address the question of which groups of people the negative designations refer to and whether there are correlations with the semantic processes behind them.
This paper accounts for a system of semantic fields that was developed in Iceland around the turn of the century. The purpose of the system was to help describe the semantic properties of the Icelandic vocabulary and to be a practical tool in lexicographic work. The system categorizes words into semantic fields, enabling nuanced organization and practical applications in monolingual and bilingual dictionaries. The article details the system’s origins, structure, 259and implementation, including its role in producing specialized glossaries and enhancing dictionary editing efficiency. While the system has proven valuable, some inconsistencies in classification and level of detail are noted, suggesting areas for improvement. Lastly closing observations are presented.
The paper presents a count-based semantic vector space model for Ukrainian, which has been applied for the semantic change detection task. The approach assumes creation of multidimensional vector representations of occurrences for a particular lexeme or a group of related lexemes with further visual and quantitative analysis of the obtained semantic vector space. The multidimensional space has been reduced to 2D for visual data analysis with the Multidimensional Scaling technique. The paper described two case studies to show how the proposed R & 251D workflow helps revealing potential semantic change events and discuss benefits and limitations of the approach. One case study traces the disappearance of a regional sense, and another identifies the appearance of a new metaphoric sense that is widespread in the Ukrainian media discourse.
Within the field of historical lexicography, the study of old occupational terms and societal roles offers a unique lens through which we can improve our understanding of the evolution of language and its reflection on dynamics relating to a given cultural context. This paper offers a semantic analysis of the terms used to describe various occupations, focusing on sorcerers and magicians in the Estonian lexicographical tradition from the 17th to the 18th century. This period marks Estonia’s earliest stage of dictionary culture. The significance of examining words within this category is their ability to provide an understanding of societal roles, beliefs, and semantic changes over time. Although numerous job titles have remained primarily unaltered, we can also observe interesting modifications in morphology (such as the use of deverbal causatives) and semantics. This can be illustrated with the case of õppija (derived from the verb õppima), which used to mean both ‘student’ and ‘(Lutheran) minister; teacher’ for almost a century, before the causative õpetama ‘to teach’ gave the noun õpetaja ‘(Lutheran) minister; teacher’ and õppija came to be used only in the meaning ‘student’ (Jürviste, 2023). Occupational titles not only indicate the professions and social status of individuals but also reflect the societal values and hierarchies of the time. Similarly, the terminology used to describe sorcerers and magicians can reveal much about the cultural and social attitudes towards esoteric and traditional knowledge. We intend to examine these terms’ nuanced meanings, origins, and historical development through lexical semantics, illustrating the findings with specific examples. Our methodology involves pprox. the vocabulary in six dictionaries published in the 17th and 18th centuries containing the Estonian language (Stahl, 1637; Gutslaff, 1648; Göseken, 1660; Vestring, 1998 [1710–1730]; Helle, 1732; Hupel, 1780; Hupel, 1818). This analysis is complemented by a review of later terms and their usage in contemporary texts. Significant lexical findings were made by later folklorists who studied dialects, including the terms of sorcery. The study covers approximately one hundred terms, although a narrower focus is given to a smaller set of examples. Categorising these terms into semantic fields, tracing their etymological roots, and examining their contextual usage can help identify patterns of semantic shifts, societal attitudes, and the influence of external factors such as religious, cultural, and linguistic contacts. More specifically, we will investigate the possibility of pprox. those terms with the help of componential analysis and prototype theory (as described in Geeraerts, 2010). Can we draw semantic borders within terms that may not – perhaps – allow such pprox.ng? It is worth noting that the study of terms related to sorcerers and magicians holds significance beyond mere academic curiosity. The paper offers valuable insight into the intricate relationship between language, culture, and society, particularly regarding how individuals interact with the mystical and unknown. The disparities between the 17th and 18th centuries were noteworthy in this field. Furthermore, this research contributes to the field of historical lexicography by showcasing the dynamic nature of language and its capacity to adjust to evolving societal norms and values. By comprehending the semantic evolution of these terms, a deeper understanding of the cultural and social fabric of Estonian society during the 17th and 18th centuries can be gained. This research enriches knowledge about Estonian lexicography and contributes to broader discussions on the relationship between language, society, and culture.
This article addresses a crucial topic in lexicography and metalexicography, namely the challenge of defining meanings in historical dictionaries. Its aim is to present an overview of the criteria for crafting effective definitions by 268semanticists, logicians and lexicographers, primarily tailored to meet the exigencies of contemporary dictionaries, attempt to apply these principles to an academic historical dictionary of the Polish language and assess the feasibility and justification of their implementation. The analysis encompasses: a) types of dictionary definitions; b) fundamental tenets of effective definitions, such as adequacy, substitutability, translatability and analyticity; c) common pitfalls regarding the formulation of definitions (inadequate definitions—overly broad or narrow definitions, direct or indirect circular definitions, ignotum per ignotius); and d) the lexicographer’s perspective, including the linguistic versus encyclopaedic nature of definitions, the quandary of categorisation (taxonomy) and valuation. The basis for the analysis is the Electronic Dictionary of the 17th- and 18th-century Polish. The theory of definitions and its practical implementation in this dictionary are discussed. References are also made to other Polish historical-philological lexicons to elucidate comparable challenges and facilitate generalisation.
This article discusses the project, Dictionary of the Dubrovnik Idiom, conducted at the Institute for the Croatian Language. The project aims to develop a borndigital diachronic dictionary of the Dubrovnik idiom, covering the period from the 16th century to the end of the 20th century. The dictionary will be based on a historical corpus compiled within the project’s scope, featuring texts from the same period. Upon completion, this dictionary will be publicly available in digital form, providing valuable insights into the linguistic evolution of the Dubrovnik region. The creation of such a dictionary will meet the needs of the scientific and cultural public as well as the citizens of Dubrovnik. Specifically, the Dubrovnik idiom is relevant because it played an important role in the standardization of the Croatian language. Additionally, well-known Renaissance and Baroque literary works are composed in it, and today it is in decline, which concerns its speakers.
Cross-lingual embedding models act as facilitator of lexical knowledge transfer and offer many advantages, notably their applicability to low-resource and nonstandard language pairs, making them a valuable tool for retrieving translation equivalents in lexicography. Despite their potential, these models have primarily been developed with a focus on Natural Language Processing (NLP), leading to significant issues, including flawed training and evaluation data, as well as inadequate evaluation metrics and procedures. In this paper, we introduce cross-lingual embedding models for lexicography, addressing the challenges and limitations inherent in the current NLP-focused research. We demonstrate the problematic aspects across three baseline cross-lingual embedding models and three language pairs and outline possible solutions. We show the importance of high-quality data, advocating that its role is vital compared to algorithmic optimisation in enhancing the effectiveness of these models.
The LBC-Platform (https://www.lessicobeniculturali.net) is a comprehensive lexical information system that aims to integrate various types of corpora and resources: dictionaries, concordances, monolingual Language for Special Purposes (LSP) corpora in different languages and LSP parallel corpora. Designed for users interested in cultural heritage, the platform provides free access to resources and tools such as NoSketch Engine, facilitating an in-depth understanding of the language of Fine Arts — encompassing painting, architecture, and sculpture — and highlighting its multidisciplinary nature and diverse discourse types. Numerous studies have demonstrated the platform’s utility for various purposes, including lexicography, translation, and didactics. This paper focuses on the preparation of the multilingual Vasari parallel corpus, part of the LBC Project’s four planned subcorpora (Vasari corpus, Literature corpus, Museum Web-Site Corpus, Travel Guide Corpus). Special attention is given to the Italian-English, Italian-French, and Italian-German components of the corpus, based on G. Vasari’s Le vite de’ più eccellenti architetti, pittori, et scultori italiani (1568) and its translations. In addition to methodological choices including alignment strategies and potential technical solutions, small case study on the words disegno (‘drawing’) and disegnare (‘to draw’) will serve to illustrate the current status of the project and its relevance for LBC dictionaries, as well as future perspectives.
We present the COR.SEM lexicon, an open-source semantic lexicon for general AI purposes funded by the Danish Agency for Digitisation as part of an AI initiative embarked upon by the Danish Government in 2020. COR.SEM describes the core senses of 34,000 Danish lemmas with formal semantic information, e.g., ontological type, hypernym, semantic frame, regular polysemy pattern, and polarity value; features which are in essence drawn and simplified from other existing resources. Lexical information from The Danish Dictionary DDO and the Danish Thesaurus DDB is also integrated, e.g., user examples, domain label, synonyms, and near synonyms. It provides direct links to synsets in the Danish WordNet DanNet, as well as to the morphological lemma information in COR, the Central WordRegister which is based on the Danish Orthographical Dictionary and DDO. The register’s common numerical index at both lemma and sense level makes it is more straightforward to merge mono- as well as bilingual dictionaries with COR.SEM and thereby inherit the formal semantic information. At the website corsem.dsl.dk it is possible to browse the lexical entries and to download tailored extracts of data of your choice. We give examples of the use of COR.SEM in linguistic studies, in NLP tasks and in lexicographic projects.
This study presents an innovative approach to crafting and enhancing Japanese lexical networks by incorporating large language models (LLMs), especially GPT-4o, utilizing data from Vocabulary Database for Reading Japanese to accommodate various proficiency levels. Through this process, we extracted a total of 137,870 synonym relations and 54,324 antonym relations, forming a network comprising 104,427 nodes. A portion of the dataset underwent manual evaluation to determine the accuracy of the extracted synonym relationships, yielding an average evaluation score of 4.08 out of 5. Our findings demonstrate that employing graph-based methods enhances transparency and interpretability, allowing for the visualization of intricate semantic structures and enabling continuous updates. The study emphasizes the synergy between AIdriven data generation and traditional lexicographic expertise, offering a scalable and adaptable framework for diverse linguistic applications, with implications for computational linguistics and NLP technologies.
Dictionaries have traditionally served as more than mere repositories of words; they have aimed to sketch some of the relationships between words, including semantic, collocational, or hierarchical connections. However, the physical constraints of print media often limited their scope, restricting the depiction of these relationships to cross-references, exemplifications, and, in specialized instances, etymological connections or language comparisons (in the case of bilingual dictionaries). This left users to piece together the broader semantic network on their own. The potential for lexicographical work has seen remarkable growth with the advent of digital platforms. These platforms transform our view of the lexicon from a collection of words to a dynamic network, emphasizing the significance of various word relationships. An example of such platform is Latent Dictionary project (https://latentdictionary.com) that, despite its nascent development stage, showcases the potential of digital lexicography for mapping semantic space and lexical relationships by utilizing contextual word embeddings. This method opens up new possibilities for deeply exploring the lexicon and understanding how words relate to each other. Exploring the networks of words has led to incorporating the contextual information of lexemes (collocations, embeddings). A growing body of research also highlights how a word’s grammatical behavior can uncover unique characteristics, contributing to a more precise mapping of these networks. Central to our study is the introduction of grammatical profiles into lexicographical research. The grammatical profiles, as discussed by Janda & Lyashevksaya (2011), is the relative frequency distribution of the inflected forms of a lexeme. In our study, we examine the distribution frequencies across 14 Czech noun cases, divided into 7 singular and 7 plural forms, to understand their relative frequency.
In compiling dictionaries, considering word formation is crucial in several phases: a) selecting criteria for including words in the dictionary, b) uniformity in dictionary definitions of headwords that have some word-formation relation, and c) consistency in qualifying certain senses as stylistically marked. Regarding the criteria for including words in the dictionary, this article pays special attention to statistical analysis of the correlation between multistage derivatives and the frequency of individual words in word-formation chains, in corpus sources, and 260in the case at hand the Gigafida 2.0 and metaFida corpora. It is assumed that the frequency of a word decreases as its formation stage becomes higher. This article presents dictionary solutions as indicated by dictionary concepts for the monolingual dictionary of standard Slovenian.
In this contribution, we propose and discuss lexicographic devices for the presentation of aspectual properties of verbs in the entries of a monolingual electronic dictionary for advanced learners. The umbrella term ‘verbal aspect’, as we understand it, denotes the interplay of different linguistic devices that contribute to expressing the temporal structuring of events and situations in a language (Comrie, 1976, p. 3), i. e. “how events unfold over time” (Croft, 2012, p. 4). Verbal aspect can thus be seen as a conceptual category that is realised differently in each language (Dessì Schmid, 2014). We assume that the verb is the pivotal element in the expression of verbal aspect and that each verb or verb meaning can be associated with different ‘aspectual properties’ on a semantic as well as a syntactic level (Johanson, 2000, p. 66; Coll-Florit, 2009). For example, the English verb run can, according to the Oxford Advanced Learner’s Dictionary, denote a physical activity as in He ran home in tears to his mother (OALD, s. v. run) or the location of an entity as in The road runs parallel to the river (OALD, s. v. run). From an aspectual perspective, the first meaning is an accomplishment (Vendler, 1957, pp. 145f.), which takes an animate being as the subject as well as a location adverbial or prepositional phrase and which can be used in the progressive form. The second meaning is a state (Vendler, 1957, pp. 146f.) that is associated with an elongated entity in a fixed position and usually does not allow the progressive form. This basic example shows that different aspectual properties of verb meanings can be related to distinctive semantic and combinatorial restrictions. While there is a vast amount of linguistic literature on verbal aspect (for an overview: Sasse, 2002; Filip, 2012; Dessì Schmid, 2014), lexicographic research (and existing dictionaries) have not yet fully recognized the importance of the phenomenon. We are, however, convinced that by explicitly considering verbal aspect, we can render the description of verb meaning and combinatorial properties of verbs more accurate and exhaustive and we can provide learners with additional information that is important for the language production of advanced foreign speakers. The advantages of considering verbal aspect in lexicography will be discussed in relation with a specific dictionary, the Phrase-Based Active Dictionary (PAD), which is currently being developed for German, English and Italian (DiMuccio-Failla & Giacomini, 2022). The dictionary is phraseology-centred in that it subscribes to Sinclair’s claim that a word’s meaning can be identified and described by the occurrence of the word in a distinctive pattern which is determined by: (1) a specific grammatical structure (‘colligation’), (2) associated words (‘collocation’) and (3) associated groups of words with shared semantic features (‘semantic preference’) (i. a. Sinclair, 2003). These patterns are the smallest units of the dictionary entries and can be derived from corpus data (Giacomini et al., 2020; DiMuccio-Failla & Giacomini, 2022). Starting from the usage patterns, the dictionary’s microstructure is constructed to make cognitive relations between the senses and sub-senses of the word in question accessible to its users (DiMuccio-Failla & Giacomini, 2017; 2022, pp. 481–485). Verbal aspect can be included in different parts of the PAD’s dictionary entries. For instance, it is an integral part of each pattern of a verb since each verb meaning necessarily expresses some kind of temporal structure. Beyond that, it can – to different degrees – contribute to meaning variation of a verb or influence its combinatorial preferences. We will show in detail how verbal aspect information can be presented to users, by discussing selected entries for verbs of movement from German, English and Italian in the PAD. We will focus on the way in which different parts of the entry (definitions, usage examples, but also specific items, graphs or usage notes) may convey aspectual information. The presentation of aspect information is based on two desiderata: (1) it should be based on a well-defined set of terms, since the area of verbal aspect is characterised by conflicting views on terminology and concepts (Sasse, 2002, p. 199; Dessì Schmid, 2014, p. 47ff.). (2) The terms should be (made) accessible to users without expert knowledge in the area, since widely used terms such as ‘telic’, ‘imperfective’ or ‘unbounded’ are not readily accessible.
NomVallex is a manually annotated valency lexicon of Czech nouns and adjectives that enables research into various language phenomena related to valency, including the comparison of valency properties of affirmative and negative forms of words. This paper presents new developments in the way the lexicon facilitates research into word-level negation, explaining the reasoning behind the proposed lexicographic treatment. Differentiating between direct negation and lexicalized negation, we discuss whether or not the negative forms of words should be listed separately in a valency lexicon. We argue that, while lexicalized negation has to be assigned an individual entry, direct negation of nouns and adjectives should be treated within an entry for affirmative forms. Considering various aspects of word-level negation, including the employment of negative forms of nouns and adjectives in light verb constructions or phrasemes, we describe negation-related attributes applied to the data of the lexicon. As a case study, the facilities provided by the lexicon are used to illustrate use and distribution of negative forms of Czech deadjectival nouns, and their valency properties are compared to those of the corresponding affirmative forms.
John Pickering’s Vocabulary is judged to have been scarcely analyzed in detail, in spite of the fact that it is the first dictionary of Americanisms. (In this respect, the situation seems to be greatly different from the case of Robert Cawdrey’s Table Alphabeticall (1604), the first English monolingual dictionary in England.) In this situation, it has been found recently that Pickering utilized wide range of dictionaries in the compilation process of the Vocabulary and that he may be worthy to be called an originator of comparative lexicography in America. However, there is another historically significant aspect concerning the dictionary. That is, Pickering exercised his ingenuity in providing abundant linguistic information by utilizing quite a few materials other than dictionaries. In this paper, I will deal with the aspect, aiming to further clarify the first stage of American lexicography.
Despite the decreasing use of regional and local varieties of the Dutch language, there is a growing public interest in dialects in the Netherlands and Flanders. Several dialect associations strive to preserve the local dialect by creating lexicons, establishing spelling conventions, writing texts in their local dialect, teaching the dialect, and sharing knowledge about their local dialect with the general public. This work is done by dialect enthusiasts, volunteers, with limited resources and very limited technical support. In 2021, the Dutch Language Union decided to find ways to support this work and asked the Dutch Language Institute to explore the possibility to develop and maintain a lexical infrastructure that could not only help to create lexical resources but also provide means to make the data accessible to dialect users and learners. It was decided to carry out this exploration in the form of a pilot project for Bildts. There is a very active and lively language community in Bildt as evidenced by a report on Frisian-Dutch contact varieties in Friesland, which highlights the desires and requirements of smaller language variations. The pilot should result in an infrastructure that not only meets the requirements of the Bildts language community but also lays a foundation for future infrastructure development for other dialects. Ultimately the intended dialect infrastructure should serve as the primary resource for users of often small language varieties. Providing such an infrastructure will not only streamline the inventory and description of language varieties but also facilitate users’ search for information on words, spelling, and grammar. The requirements for such an infrastructure for Bildts were formulated by a steering group. It was decided to focus on written dialect data and the consensus was that a lexical database was the first priority. The aim: enabling people to learn Bildts.
Lexical Data Editing Environment The underlying concept is as follows: a list of words in Dutch is paired with their corresponding word in Bildts. The Dutch word list is based on sources such as Hazenberg et al. (1992), meant for language learners at level B1. For Bildts, an existing dictionary (Buwalda et al., 2013) is utilized, with its content immutable. In cases where words are absent from Buwalda, the platform permits the addition of new words, automatically categorizing them as part of a new lexicon (Woordeboek Bildts Aigene (WBA)). This approach enhances both the suitability of the lexical resource for dialectal language production and the comparability of lexical resources across different dialects within the infrastructure. It represents an initial step towards establishing a comprehensive onomasiological resource. The WBA includes an editing environment, with all editing operations being logged for quality control purposes. During the presentation, we will demonstrate the various steps involved in the editing and linking process. Publication platform Following the editing phase, the data will be made available online on the websites of the collaborating organizations. The data will be accessed through a search facility based on the Woordwaark platform which will be adapted by INT to fit the data and requirements for Bildts. Woordwaark utilizes a PostgreSQL database to store all relevant data. The website is created using JavaScript and Node.js to provide an interactive experience. In addition to the website, it will also be possible to search the dictionaries using a JSON API, which is useful if other websites or services want to integrate with Woordwaark. The entire system is packaged into one or more Docker images, making it relatively easy to deploy in different environments.
Documentation At the end of the pilot phase, extensive documentation will be made available for use by other regional languages interested in adopting the digital infrastructure. This documentation will include a workflow description, a user manual, and a report detailing the pilot experiences. We will identify best practices, necessary partners, and their respective roles and tasks, aiming to provide better assistance to other language varieties seeking to utilize the provision.
English language today serves as a crucial medium for knowledge dissemination (Jenkins, 2014), facilitating more global linguistic interactions than any other language (Galloway & Rose, 2015). This extensive reach has significantly influenced lexicography through corpus linguistics, providing evidence of word use both as single units (Coxhead, 2000) but also in phraseological combinations where corpora have had the most transformative impact (Paquot, 2015). Additionally, EAP lexicography (Frankenberg-Garcia et al., 2019; Paquot, 2015; Wanner et al., 2013; Granger & Paquot, 2010) and discipline-specific EAP lexicography (Rees, 2018; 2021) have garnered significant attention in recent years. In the context of EAP pedagogical lexicography, this presentation aims to explore how different semantic categories of nouns succeeded by that and a complement clause, a construction crucial for conveying stance in academic writing (Biber et al., 1999), are variably used in three academic disciplines: Applied Linguistics, Education, and Psychology. In each of these three disciplines the study specifically targets two distinct subfields: in applied linguistics the focus will be on teaching and learning; in education, the focus will be on instructional science and learning; in psychology, the focus will be on educational and cognitive aspects. Opting to focus on two subfields within each discipline, as in previous studies (e.g., Hu & Cao, 2015), this research also reveals semantic differences embedded in discourse and rhetoric among subdisciplines within the same academic field. Analysing 535.214 words (Table 1) from research articles in three fields and six subfields in total, the research focuses on the frequency and semantic nuances of the Noun Complement construction in question. This study utilizes ALPE, a manually compiled corpus in Sketch Engine for this analysis. Adopting the methodology pioneered by Francis et al (1998) and refined by Charles (2007), this study delves into a detailed categorization of nouns, segmenting them into various semantic groups. This methodology facilitates a more nuanced analysis by semantically grouping nouns within their thematic or conceptual categories, uncovering deeper patterns and meanings, and offering insights into their contribution to the overall discourse in specific academic fields. Finally, the presentation concludes with recommendations for the lexicographical treatment of noun that constructions applicable for dictionary compilation, teaching materials development, online resources and writing assistant tools that will prove useful to lexicographers, EAP teachers, and by extension to students, native and non-native alike.
In this paper, we explore the possibilities and challenges of lexicographic treatment of pragmatic markers, specifically epistemic and evidential markers in Czech. Our starting point is a detailed comparison of how these expressions are treated in contemporary monolingual Czech dictionaries. Following this, we present the development of the SEEMLex lexicon of Czech epistemic and evidential markers which is based on detailed annotation of selected expressions using data from a Czech-English parallel corpus. We describe the features we annotate when analysing the expressions studied, outline the main aspects that constitute or distinguish their meanings, and emphasise the importance of considering the communicative function in which these expressions are used. Additionally, we highlight the benefits of using a specialised lexical database for the lexicographic processing of pragmatic expressions in general. We demonstrate our approach with a draft of a dictionary entry for the common Czech epistemic marker asi ‘probably’ providing a comprehensive example of our methodology.
From 2017 until 2021, the Croatian Web Dictionary – Mrežnik was a project of the Croatian Science Foundation; from 2022 until now, it has been an internal project of the Institute for the Croatian Language, and from 2024, it will be funded by the EU program NextGeneration EU. The project goals are to compile an e-dictionary of the Croatian language that is online, free, corpusbased, monolingual, hypertext, searchable, normative, and based on the contemporary results of e-lexicography and computational linguistics. Mrežnik consists of three modules: for adult native speakers of Croatian, schoolchildren, and non-native speakers of Croatian. It will be the central meeting point of the existing language resources of the Institute of Croatian Language and Linguistics but also of all language resources created within the project. Croatian Web Dictionary – Mrežnik is conceived as a dynamic dictionary that will be further compiled and edited even after the end of the NextGeneration EU project, as it is a long-term project of the Institute for the Croatian Language. The reason for launching the Mrežnik project was primarily because in 2016, at the time of the project application, Croatia was still one of the countries that did not have an online dictionary of their national language compiled according to the rules of contemporary e-lexicography. The need for extensive scientific research in e-lexicography was also recognized, i.e., getting to know the theory and practice of creating e-dictionaries and the possibilities that new dictionary platforms offer. Mrežnik is compiled taking into account semantic relations and the systematic nature of language. The systematic nature of the dictionary can be seen in almost all areas: accentuation of entry words, the selection and accentuation of forms in the grammatical block, the definition of words that belong to closed grammatical and semantic groups, etc. The two essential computer tools for compiling this three-module dictionary are Sketch Engine, a corpus query system (loaded with the corpora) to support language analysis, and TLex, a dictionary writing system. Word Sketches are specially adapted to the needs of the project and are based on a developed Sketch Grammar. In 2022, a part of the dictionary (A – F) was exported from TLex to both the web application (https://rjecnik.hr/mreznik/) and the CLARIN European science infrastructure repository (clarin.si repository and the github.com public data management system). The presentation will focus on the corpora and wordlist(s), normative and pragmatic aspects of Mrežnik, micro- and macrostructure of Mrežnik, and the place of grammar in Mrežnik. The fact that Mrežnik is the first gamified Croatian web dictionary and the first dictionary with recorded pronunciation will be stressed. The comparison of the three modules will also be addressed, and it will be shown that the center of all lexicographic decisions was always the user.