Speakers
Description
We present a demo of MWE-Finder, an application that enables a user to search for (flexible) multiword expressions (MWEs) in Dutch text corpora (Odijk et al., 2024). We will show many different examples in the demo, but here we show one example.
A multiword expression (MWE) is a word combination with linguistic properties that cannot be predicted from the properties of the individual words or the way they have been combined by the rules of grammar (Odijk, 2013). Many MWEs in Dutch are flexible in the sense that the component words can occur in multiple forms, are not necessarily adjacent, and do not always occur in the same order. This makes it difficult to search for such MWEs with most existing query systems, but MWE-Finder has been specifically made to deal with this.
The targeted users of MWE-Finder are linguists and lexicographers who want to investigate properties of Dutch MWEs. A user can enter an MWE in canonical form. A canonical form is a unique form that represents a set of linguistic objects, in this case a set of variants of the MWE (different component forms, different word orders, etc.) The notion canonical form for MWEs has been defined in Odijk & Kroon (2024). Users can enter their own MWE in canonical form or select one from more than 11k canonical forms
that MWE-Finder offers. These canonical forms are mostly based on the native speaker intuitions of the creator of the resource. This canonical form can be seen as a hypothesis about the properties of this MWE. In particular, by using this canonical form it is stated that the word dans ‘dance’ cannot be modified and that it must be accompanied by the determiner de ‘the’.
We call this MWE the target MWE and when it is entered, MWE-Finder automatically generates three queries: MWE Query (MEQ) this searches for the MWE;
Near Miss Query (NMQ) searches for the content words of the MWE with the
grammatical configuration they are in in the MWE;
Major Lemma Query (MLQ) searches for the content words of the MWE
ignoring the grammatical configuration.
These queries are increasingly less strict. The user can now select the treebank to search in. MWE-Finder offers many treebanks, and users can also upload their own corpora, which are turned into treebanks and made available for search. We select the Mediargus treebank.
The results of the queries are presented to the user on the screen as they come in. For the MWE as in (1), the MEQ finds 1158 hits in over 103 million sentences. The NMQ finds 1271 hits. If we exclude the results of the MEQ (an option that MWE-Finder offers), we quickly see in the 131 remaining hits that the target MWE occurs in variants not predicted by the canonical form that we started with, because the word dans can occur with a variety of modifiers and determiners.
This suggests that the canonical form that we started with was too strict. We must allow for modification of the MWE component dans and the article de is not a component of the MWE. A better canonical form would be iemand zal dd:[de] *dans ontspringen, in which the code dd:[..] surrounding de indicates that de can be replaced by other definite determiners, and the * before dans means that it can be modified. In this way, we can improve upon an initial canonical form mainly based on native speaker intuitions by systematically taking into account corpus data. MWE-Finder makes this possible in a very efficient and user friendly way.
The MLQ finds 1309 hits. If we exclude the results of the NMQ, we have to inspect 38 examples. These are mostly valid instances of the MWE de dans ontspringen that have been wrongly parsed, but we also find a variant of the MWE, viz. (2), for which we can now add a canonical form to our lexicon of MWEs in canonical form.
In this way, a linguist or lexicographer can easily and efficiently investigate the properties of Dutch MWEs, and improve the description of Dutch MWEs.