Speakers
Description
Good dictionary examples are hard to come by. Despite corpora growing larger and larger, lexicographers still have difficulties in finding good candidate sentences for exemplifying how the dictionary headwords are used in context. There are automatic methods available to address this time-consuming task. One such method is GDEX, a feature of the Sketch Engine tool (Kilgarriff et al., 2004), which we have been using in our lexicographic projects and for which have developed language-specific and even project-specific configurations in recent years. GDEX works on the principle of ranking (randomly selected) corpus sentences according to heuristics defined in the configuration file. This means that it can efficiently separate wheat from the chaff using criteria such as (preferred) sentence length, forbidden or preferred lemmas or forms etc. However, in many cases, there are many sentences left with high(est) GDEX scores and many of those can be found problematic, which is an issue as only the top X sentences per headword or sense are often used for dictionary purposes.
The recent arrival of large language models, in particular ChatGPT, has taken the (lexicographic) world by storm, with many lexicographers advocating the help of ChatGPT for tasks such as definition writing, sense division, and even the production of entire entries (Lew, 2023; de Schryver, 2023). Therefore, we wanted to explore whether ChatGPT could also be used for the task of example selection.
We prepared a task in which the same ChatGPT prompt in English was used on 40,000 sentences extracted from the corpora of Brazilian Portuguese, Dutch, Estonian, and Slovenian, for the purposes of preparing manually annotated datasets for teaching and learning purposes as part of the CLARIN Resource Families project (Zingano Kuhn et al., 2022). Each dataset consists of full-sentence examples for 100 lemmas (100 examples per lemma) from three different groups: a) 20 lemmas that are offensive or vulgar, b) 60 lemmas that are offensive, vulgar or sensitive in one of their meanings, and c) 20 lemmas that would typically not be considered offensive, vulgar or sensitive. The lemmas were originally selected in English and then translated into four languages; in the selection process, certain lemmas were discarded if found problematic from the perspective of comparability, e.g., due to different levels of polysemy in target languages, several possible translations, etc. In the annotation, the examples were marked as problematic or non-problematic, with the problematic ones also annotated with the type of problem: offensive content, vulgar content, sensitive content, incorrect spelling and/or grammar, and lack of context. We also replicated the annotation with ChatGPT-4 (Kosem et al., 2024). Using ChatGPT-4, the following prompt was used for the task.
Below are sentences for the <language> word “<word>”. Select the best 5 sentences that can be used in a general dictionary of <language>.
SENTENCES
<list of sentences>
In the first step of the evaluation, we compared the selected examples with the results of the manual annotation. Overall, the percentages of sentences deemed non problematic by manual annotators and selected by ChatGPT were: 70% for Estonian, 54% for Slovenian, 50% for Brazilian Portuguese, and 40% for Dutch. Of the examples deemed problematic by manual annotators, most of them were marked for sensitive content, followed by lack of context. The ratios of problematic categories per language were similar, with some exceptions (e.g.,50% or more problematic examples were marked for sensitive content for all the languages except for Brazilian Portuguese). The analysis across groups of lemmas showed that in general, the number of problematic examples per lemma is dropping from group a to c.
Next, we compared the selected examples with the results of annotation by ChatGPT with the same categories as used for manual annotation. The percentages of selected examples by ChatGPT that were also deemed non-problematic by ChatGPT was 39% for Estonian, 36% for Slovenian and Brazilian Portuguese, and 33% for Dutch. In terms of categories, sensitive content and lack of context were again most common, but there were also many examples marked by ChatGPT as containing offensive content (30% of problematic examples for Estonian, 26% for Dutch, and 25% for Slovene and Brazilian Portuguese).
While these results may seem discouraging, it should be noted that the original annotation was focused more on the pedagogical value of examples rather than on the lexicographic. To illustrate – while language teachers may want/prefer to exclude all offensive and vulgar words out their material, lexicographers have to find good examples for all types of vocabulary. Thus, we are now conducting a detailed lexicographic evaluation of the examples selected by ChatGPT. We will reannotate all the 500 examples to determine whether they are suitable for dictionary purposes. In addition, we will look at all the examples of each lemma to determine whether more suitable good dictionary examples can be found, and also to see whether there are even five good examples candidates in the dataset. The results of these analyses will be presented at the conference, where we will also provide our conclusions and recommendations in using ChatGPT for purposes of dictionary example selection.