8–12 Oct 2024
Hotel Croatia
Europe/Warsaw timezone

Leveraging Dictionary Look-Up Behaviour to Supplement CEFR Vocabulary Lists

8 Oct 2024, 16:30
30m
Ragusa Hall (Hotel Croatia)

Ragusa Hall

Hotel Croatia

Speakers

Sascha Wolfer Robert Lew

Description

The look-up behaviour of dictionary users has an established place in lexicographic research (Bergenholtz & Johnson, 2005; Lemnitzer, 2001; Lorentzen & Theilgaard, 2012; Trap-Jensen et al., 2014). It has been used with some success to improve the quality of the interaction between the dictionary and its users, such as through discovering users’ typical search patterns, their strategies and errors, as well as for fine-tuning the dictionary interface to better serve users. In this study, we leverage user look-up frequency in the English Wiktionary. Research so far has identified a robust relationship between lexical frequency and look-up frequency across various languages and dictionaries (De Schryver et al., 2019; Koplenig et al., 2014; Müller-Spitzer et al., 2015). More recently, other factors have been shown to have an influence on look-up frequency: a word’s age-ofacquisition; its polysemy status (whether the word has one or more senses); prevalence in the speaker population (Lew & Wolfer, 2024); and possibly also part of speech. Centrally to this research, our preliminary findings suggest that the CEFR level (Council of Europe, 2020, pp. 36–37) explains further additional variance in look-up behaviour, beyond what the other lexical factors are already telling us. Thus, the research so far shows that dictionary look-ups can be predicted, supplying information that is of great practical utility in compiling new dictionaries, as well as in improving existing dictionaries so they can serve their users better. However, the interesting question we want to tackle here is: can we extract additional insights from the look-up data itself that go beyond lexicography? Or, to paraphrase a famous statement, ask not what you can do for your dictionary — ask what your dictionary can do for you! A known challenge in language learning is the compilation, updating, and expansion of CEFR-graded vocabulary lists: a task that is highly labour-intensive, and has tended to use methodologies that are not always transparent, welldocumented, or readily replicable. In this connection, our study proposes using dictionary look-up data, alongside a few other relatively easy-to-obtain lexical properties of words, to predict (or impute) the CEFR level of words, as shown in Figure 1. Our research tries to address the following three fundamental questions: 1. Are classification algorithms able to predict CEFR levels with an accuracy higher than a random baseline (and if so, how much higher)? 2. Which specific algorithm performs best on this task? 3. How much agreement is there among different algorithms in categorizing words into CEFR levels? Our main goal is to develop and test an automated and reliable process for generating lists of candidate words that could then be fruitfully added to existing CEFR lists. When set to predict candidates for three broad levels A, B, C, and using the English Vocabulary Profile (Cambridge University Press, 2015; Capel, 2012, 2015) for training, our best-performing models (Regression Trees, Ordinal Logistic Regression, and Random Forests) returned the following words among candidates for A level: become, true, mom/ma, daddy, and bitch. For level B, we obtained, among others: whip, clay, chamber, commission, and deed. Some of the candidates appear to be worthy of inclusion, though probably not all of them, such as common swear words generally deemed inappropriate for educational contexts. The proposed method may present a convenient way to update CEFR vocabulary lists, but its feasibility crucially depends on a number of conditions. For the algorithms to be effectively trained, a substantial pre-compiled CEFR list is required as a starting point: our current estimate puts its minimum size at a few thousand items. The selection of predictors is another crucial aspect, and our study offers an initial set of predictors that may be refined in future research. Finally, the success of our approach depends on the availability of open dictionary data, including metrics such as views, number of senses, and part-ofspeech information.

Co-authors

Presentation materials

There are no materials yet.