Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

Up to No Good: Exploiting Word Embeddings for an Automatic Extraction of Candidates for a Lexicon of Slovene Taboo Language

Nov 19, 2025, 12:00 PM
1h
Lobby

Lobby

Speaker

Jaka Čibej

Description

POSTER

Lexicons of taboo language are useful language resources that can serve multiple purposes. In addition to their direct use to either automatically censor words deemed inappropriate for a given context (e.g. to help mitigate the problem of online hate speech), they can also help filter out materials not suitable for educational purposes (see Zingano Kuhn et al., 2022), games with a purpose (Arhar Holdt et al., 2021), training general language models (e.g. to remove pornographic content from training data). In addition, taboo language, particularly the section related to hate speech, needs to be well-documented in dictionaries as they are used as authoritative language resources (Gorjanc, 2005). Taboo language lexicons can also be useful for linguistic analyses and contrastive translation studies since swearing and taboo language are frequently culturally specific – see e.g. Klemenčič (2016) for a contrastive study of swearing in Slovene and Swedish; however, the study focused on a limited set of hand-picked expressions since no comprehensive list yet exists for Slovene, at least not in a machine-readable format.

What is included in existing Slovene language resources is either not openly accessible, is inaccurately represented (e.g. with pejorative as the only label, even though the context can be radically different in terms of intensity or taboeness: cf. bedak 'fool' vs. peder 'faggot'), or is limited in scope (Thesaurus of Modern Slovene; Krek et al., 2023), with material stemming mostly from corpora of standard Slovene, where the usage of offensive vocabulary is limited.

While similar lexicons have been compiled from existing language resources (e.g. van Huyssteen & Tiberius, 2023), we present an approach for constructing a list of Slovene taboo language candidates using the FastText embeddings trained on a number of Slovene corpora (including web-crawls). We first extract seed entries from the Thesaurus of Modern Slovene 2.0 (Krek et al., 2023), which is part of the Digital Dictionary Database of Slovene (DDDS; Kosem et al., 2021). in which at least one of the senses has been assigned a relevant label (hate speech, vulgar/coarse, expresses a negative attitude; see Arhar Holdt et al., 2022). We group them manually (e.g. religion-based, race-based, gender-based, homophobic slurs, words with sexual connotation), then use their embeddings (Terčon et al., 2023) and cross-compare them with other embeddings using cosine similarity to obtain a list of candidates for similar words.

We discuss the results of this extraction as well as the advantages (e.g. the detection of non-standard words or words that are rare in the corpus and might not be detected through a frequency-based approach) and disadvantages of this approach (e.g., it focuses on single-word expressions and is lexeme-focused instead of sense-focused). The resulting lexicon will be made available under an open-access license (CC BY-SA 4.0), also as part of the Sloleks Morphological Lexicon of Slovene (Čibej et al., 2022), which is part of the DDDS. The lexicon can provide a basis for a more detailed lexicographic analysis within DDDS, and the method can be applied to other languages.

Presentation materials

There are no materials yet.