Speakers
Description
Mastering idiomatic language in its broadest sense is necessary to achieve advanced levels in language learning. Therefore, phraseological information should be quickly and easily available to language learners. To this end, the Dutch project Woordcombinaties (Word Combinations) is developing an integrated lexicographic resource combining a collocation and idiom dictionary with a pattern dictionary. It merges a word-in-context and collocation tool, following the example of SKELL (Sketch Engine for language learning), and a pattern dictionary, following examples such as the Pattern dictionary of English Verbs (PDEV) and T-PAS for Italian (Ježek et al., 2014). Woordcombinaties is based on a corpus of about 200 million tokens of primarily newspaper material from both the Netherlands and Belgium to reflect language variety in Dutch. The corpus is syntactically parsed with the Alpino parser (van Noord 2006) and uploaded into Sketch Engine. Pattern editing (using Corpus Patterns Analysis (CPA) (Hanks 2013)) is supported by SKEMA (Sketch Engine Manual Annotation), a specially developed corpus pattern editor system (Baisa et al., 2020). Pattern editing is still very much a computer supported manual task and as such extremely labour-intensive. In this poster we present some experiments exploring ChatGPT’s performance in the context of pattern editing, i.e., pattern generation, pattern classification, semantic type annotation. For all tasks we have used few-shot prompting providing ChatGPT-4 with at least two examples as well as clear instructions on the expected output (see Figure 1 for an example prompt). The results were evaluated by comparing them to the manually annotated data. You are a lexicographer working on Corpus Pattern Analysis as developed by Patrick Hanks. You know that the Dutch verb ‘boeken’ has 4 patterns defined by these implicatures: 1. iemand schrijft iets in de boekhouding [...]
You have to disambiguate a list of Dutch sentences according to these implicatures, with the output in a json object with a single property “concordances” which is an array containing for each sentence: the sentence and an implicature number. For each sentence, you return the following information:- sentence
: the sentence itself - implicature
: the number of the implicature according to the following json format for each sentence: {“text”: <sentence>, “implicature”: <implicature_number>}. Here is an example for the disambiguated output: { “concordances”:[ [ {“text”: “Het bedrijf boekte een omzet van 53,78 miljard euro” , “implicature”: “1”}, {“text”: “Ze boeken vlucht en hotel apart, zoeken zelf wel een Airbnb” , “implicature”: “3”}, …] Now process the following sentences: We hoeven dus geen hotel meer te boeken als we naar zee willen. Met de nettowinst van 67 miljoen euro boekt Adecco 40 procent minder winst dan 2012.
Initial results suggest that, of the three tasks that we explored, pattern generation is the most difficult. ChatGPT tends to copy from the example patterns provided in the input in the patterns it generates. ChatGPT also struggles with semantic type annotation, whereas the results for pattern classification are rather promising. The ongoing study using ChatGPT will provide comprehensive insights into its performance in terms of CPA-like analysis.