Speakers
Description
This communication aims at discussing how syntagmatic constraints in the lexicon can be provided in lexicographic resources more effectively than has been done to date, covering a wide range of multi-word expressions: from compounds to collocations and phrasemes. Examples are taken from the ongoing implementation of a multilingual specialised resource called ALMA – Multimedia Linguistic Atlas of Bio/cultural Food Diversity (Caruso at al., in press), but the proposed microstructural organisation can also be applied to general language dictionaries.
Multiword expressions and fixed syntagmatic units of the lexicon are key to fluent writing and speaking, and despite the prompt support that could be provided by writing assistants when writing a text, learning lexical constraints remains paramount for real-time interactions. Traditionally, dictionaries assisting in writing have provided synonyms and opposites to help find the most appropriate word to express the writer’s ideas. In recent years, on the other hand, there has been a tendency to offer alphabetical lists of the phraseological units associated with the lemma next to or below the article (e.g., CA; COUBILD; De Mauro; LDOCE; OELD). Nevertheless, a more comprehensive representation of the semantic domain may prove beneficial in assisting users in formulating statements on specific topics, and an onomasiological arrangement of relevant syntagmatic units may be instrumental in achieving this goal.
For the microstructure of ALMA, principles for sketching an intuitive ontological organisation of lexicographic data have been derived from Pustejovsky’s Qualia (Pustejovsky, 1991; 1995; Pustejovsky & Jezek, 2008; Pustejovsky & Rumshisky, 2008; Pustejovsky et al., 2014). The plural form of the Latin interrogative pronoun “quale” (or ‘what’), Qualia, capture the most salient features of entities denoted by words, positing that human knowledge of objects stems from answering four essential questions about the entity’s (i) class and domain, (ii) purpose and function, (iii) constitutive parts, and (iv) origin:
[i.] Formal quale: What kind of thing is it, what is its nature?
[ii.] Constitutive quale: What is it made of, what are its constituents?
[iii.] Telic quale: What is it for, how does it function?
[iv.] Agentive quale: How did it come into being, what brought it about?
(Pustejovsky & Jezek, 2016).
For example, ‘bread’ has its Origin in ‘kneading’ and ‘being baked’ (see knead and bake in the article, Figure 1), while different actions can be performed to add Constitutive parts, or condiments or other food, such as dip, smear, top, as illustrated in the article example: smear the bread lavishly with softened butter. In ALMA, the non-technical terms listed above are employed instead of Pustejovsky’s semantic terminology.
The syntagmatic units’ arrangement in the microstructure of Figure 1 guides users to find collocates or related words for speaking or writing about bread. For instance, white bread refers to the wheat flour used to bake this type of food, which has a characteristic white colour. The compound therefore appears in the Origin section along with the collocation fresh bread, meaning ‘a bread that has just been baked’. Other collocations, such as the Portuguese pão dormido (lit. ‘sleeping bread’), which means ‘stale bread’, stands in an ‘is a’ relation with bread, reflecting a Class and domain meaning which is listed accordingly in the article search-zones (Gouws, 2014).
Pustejovsky’s Qualia also facilitate the explicit portrayal of cultural information encoded in the lexicon, enabling cross-linguistic comparisons in metaphorrich domains like food. For instance, ‘kneading bread’ is lexicalized in Spanish as amasar or Portuguese as amassar, meaning ‘compacting the ingredients’, whereas Italian uses the denominal verb, impastare, derived from impasto, or ‘dough’. Similarly, in English and Spanish, making bread and hacer pan have synonyms that refer to the instrument used for cooking, such as baking bread [1] and hornear el pan [2]:
[1] Do you bake your own bread?
[2] Muchos italianos han comenzado a hornear sus propios panes para
recortar gastos.
The above are figurative units having “an image component […or…] a specific conceptual structure mediating between the lexical structure and the actual meaning. Hence, the content plane […] not only consists of a pure ‘meaning’, i. e. actual sense denoting an entity in the world, but also includes traces of the literal reading underlying the actual meaning” (Dobrovol’skij & Piirainen, 2021, p. 14). The actual meaning and literal reading are formalised according to the Qualia Structure in the computational lexicon of ALMA, using the external ontology of the SIMPLE model (Lenci et al., 2000). To represent the literal reading, syntagmatic units receive a second-level annotation, describing the semantic relationship existing among their elements. For example, pão and dormido are in an “agentive” relationship, as the act of sleeping alludes to the time it takes for bread to become ‘stale’. The annotation of individual components in the multiword is formalised using the OntoLex-Lemon model (McCrae et al., 2017) in addition to the SIMPLE framework.
Such formal structuring allows sophisticated access to the data through tailored queries, facilitating deeper insights into language structure and usage patterns, while also improving the representation of cognitive mechanisms behind phrasemes formation.
This research therefore aims to provide more valuable and comprehensible data for ‘training’ human learners and more reliable microstructural templates for machine-readable dictionaries that can be used for various NLP tasks, particularly in machine translation systems.