8–12 Oct 2024
Hotel Croatia
Europe/Warsaw timezone

Representativeness and Balance in Multilingual Comparable Corpora for Specialized Lexicography: Revisiting Available Standards and Measures

11 Oct 2024, 09:00
30m
Šipun Hall (Hotel Croatia)

Šipun Hall

Hotel Croatia

Speakers

Anna Beatriz Dimas Furtado Elisa Duarte Teixeira

Description

A common issue in Corpus Linguistics is assessing representativeness and balance of a corpus (McEnery & Hardie, 2011). Biber (1993, p. 244) defines representativeness as “the extent to which a sample includes the full range of variability in a population.” Assessment has been traditionally tackled quantitatively and qualitatively both in monolingual and bilingual settings (Stefanowitsch, 2020). However, the issue is often overlooked and far from being solved, especially when one is working with specialized multilingual lexicography, where cultural and linguistic differences can add an extra layer of difficulty in
assessing how representative and balanced a multilingual comparable corpus is.
Not to mention the global status and presence of different languages in and out of the Internet. When it comes to balance, which has been traditionally associated with proportionality (Leech, 2007), more difficulties arise. Proportionality can be a rather tricky concept as it can refer to multiple aspects of a corpus: same genres in all corpora, same number of tokens or texts, same sources. In all of these dimensions, it is assumed that there is an equal availability of textual typologies and similar status across domains in all languages and cultures covered by the sub corpora, which does not necessarily hold truth. The issue can be even more pressing when we are dealing with highly interdisciplinary domains, such as migration and asylum, in which differences in national legal systems, migratory flux, and funding can shape the production and availability of supporting materials for asylum-seeking officers and claimants, for example. From that, another challenge arises: determining the authenticity of texts, especially when they are produced in multilingual international institutions, where determining the authorship of texts and their translation(s) can be difficult. By opting to build a corpus with material available on the Internet (Seghiri, 2011), another issue that arises is the presence of that language on the web and how this is reflective, and therefore representative, of the offline world. Another complicating factor is the power balance among languages of the world, i.e., English dominates the Internet, has more financial power to foster more written
language production, while other languages are unable to compete (Prado, 2012).
In the field of migration, the use of incorrect terminology can lead to misunderstanding and even culminate in refusal of entry. Therefore, encoding information such as regional variation in both the corpus and the lexicographic materials is not only desirable, but necessary to represent the domain properly. In this study, we discuss representativeness and balance of the Multilingual Corpus on Migration and Asylum (COMMIRE) (Furtado & Teixeira, 2022), a specialized comparable corpus of Portuguese, French, Spanish, and English with over 1 million words in each language. COMMIRE is a multi-genre, multilingual, and multi-variety corpus whose primary goal is to serve as the basis for the
Multilingual Glossary on Migration and Asylum (Furtado, 2019).
Traditionally, representativeness in specialized language corpora has been tackled qualitatively by collecting the most reputable and crucial texts belonging to a domain by following the opinion of experts or existing knowledge taxonomies. Although efficient, the method is costly and, when dealing with multiple languages, the experts are not always easily identifiable or accessible. An alternative has been to go above and beyond collecting as much text as possible, making it hard to assess when enough is enough, and adding increasing challenges for cleaning, preprocessing, storing, and analyzing the corpus in its full extent. To mitigate this issue, automatic metrics, for example, the type-token ratio (TTR), have been proposed to evaluate the balance of corpora (Seghiri, 2014).
While the provision of corpora for multiple domains has increased throughout time, few studies touch upon extending existing specialized, multilingual, multigenre comparable corpora for lexicographic purposes. The benefits of exploiting comparable corpora have been consistently demonstrated in the literature (McEnery & Hardie, 2012; Stefanowitsch, 2020; and many more), while multilingual lexicography has always been neglected due to the challenging nature of creating a unified lexicographic project to encompass rather heterogeneous nuances. Indeed, it is unthinkable to build up-to-date lexicographic resources without a corpus nowadays; yet, building and expanding monitor specialized
multilingual corpora remains a challenge. This study is the first step to address this issue and suggests a path that can be potentially useful to other multilingual resources.
In this ongoing case study, we hypothesize that representativeness and balance for multilingual lexicography purposes should be first pursued in one of the languages of the corpus, chosen to be the departing one. Then, a comparison of keyword lists across languages might be a good departing method for assessing corpus preparedness to be used as a sound source for equivalents, as suggested by Tagnin and Teixeira (2008). Finally, the process is repeated taking each language as the departing one. This means, as suggested by these authors, that the entry list for each language might be slightly different at the end, reflecting these languages and cultural differences.

Co-authors

Presentation materials

There are no materials yet.