Speakers
Description
ONLINE PRESENTATION
Taboo-language resources remain scarce for under-resourced languages like Afrikaans – despite their clear relevance for natural language processing (NLP) and applications in artificial intelligence (AI). Although Afrikaans has a long-standing lexicographic tradition, it still lacks an open-access reusable lexical database for the taboo language. One of the most crucial steps in developing a constructional database for taboo language is to identify a candidate list of taboo constructions for potential lexicographic treatment. This paper outlines and tests a range of procedures to compile and refine such a list, with the goal of establishing a replicable methodology for similar work in other under-resourced languages. The methods draw on existing data of different types and corpora representing different registers. However, many entries are either false positives or ambiguous and require validation. Hence, we experiment with various semi-automated modelling techniques. These techniques include refining the candidate list through frequency analyses in corpora, expanding the list through partial corpus matching, and comparing the results against an attested, verified subset of taboo terms.