Speakers
Description
The Vienna Corpus of Arabic Varieties (VICAV) is a digital research infrastructure for the documentation and analysis of the linguistic diversity of Arabic varieties^. Integrating methods from language technology and the digital humanities, VICAV provides a modular, sustainable platform for the creation, management, and publication of heterogeneous language resources within a shared data architecture (Budin et al. 2012; Moerth et al. 2015). At its core lies a commitment to openness, interoperability, and adherence to community standards, in particular the Guidelines of the Text Encoding Initiative (TEI Consortium 2025). Through a text-centered, standards-based design, VICAV enables the representation of diverse types of data—including an extensive bibliography, linguistic profiles, sample texts, and digital dictionaries—within a unified technical framework and a user-friendly web application (https://vicav.acdh.oeaw.ac.at).
Among VICAV’s key components are dictionaries of four Arabic varieties—Baghdad, Cairo, Damascus, Tunis—next to a dictionary of Modern Standard Arabic which mainly serves as a point of reference for the others (Procházka & Moerth 2015). These compact lexical databases, containing up to 8,000 entries each, provide structured lexicographic information enriched with English translations and, in some cases, also German, French, or Spanish. All are built on a shared TEI-based model ensuring consistent encoding and comparability across varieties.
The newest addition to the VICAV family of lexicographic resources is the SHAWI Dictionary, developed within the SHAWI Project (The Shawi-type Arabic dialects spoken in South-eastern Anatolia and the Middle Euphrates region, FWF P-33574, 2021–2027). The project investigates the varieties spoken by Bedouin communities in Turkey, Syria, Lebanon, and Iraq—which so far received little systematic attention by linguistic research. These dialects display internal variation which shows significant geographic and sociolinguistic distribution—dimensions that require fine-grained modelling beyond the capabilities of standard TEI constructs. The SHAWI Dictionary, scheduled for a beta release in late 2025, represents the first VICAV dictionary encoded entirely in TEI Lex-0, a refinement of the TEI Dictionary Module developed by the DARIAH Working Group on Lexical Resources which aims at harmonizing the representation of lexical data and facilitating interoperability across projects (Tasovac et al., 2018ff.).
The adoption of TEI Lex-0 allows for both greater formal consistency and project-specific adaptability. The SHAWI Dictionary extends Lex-0 through the TEI mechanism of ODD chaining (Rahtz 2014), producing a VICAV-wide generic dictionary schema that forms a common backbone for future resources. The SHAWI Dictionary’s project-specific adaption of this schema introduces several innovations:
(1) Encoding structures for diatopic and sociocultural variation: The element <usg type="geographic"> serves as a wrapper to embedded <name> elements for places and tribes alike which are further linked to entities in local reference resources established in the project WIBARAB (What is Bedouin-Type Arabic? 2021-2026; ERC 101020127-WIRARAB).
(2) Refined bibliographic integration: While TEI Lex-0 (and TEI P5) support citation of sources at the dictionary level, this is too coarse-grained for the needs of the SHAWI dictionary. To address this, <entry> elements in the SHAWI customization may include a <listBibl> element which contains placeholders for records from the VICAV bibliography. This allows for the addition of context-specific bibliographic details (like page numbers or comments) while at the same time avoiding multiplication of bibliographic information.
(3) Extended encoding of features specific to Arabic varieties: So far, the TEI Lex-0 specification offers no dedicated mechanism for representing morphological structures characteristic of Semitic languages. The SHAWI customization therefore introduces new attribute values for @type on <gram> to capture phenomena such as root-based derivation, morphological patterns, and verbal stem classes.
By applying the TEI Lex-0 Schema to dialectological context, the SHAWI Dictionary demonstrates the adaptability of community standards to non-Indo-European linguistic data. It contributes both to the ongoing consolidation of digital lexicographic practices and to the sustainable documentation of previously underdescribed Arabic varieties, giving an example of how TEI-based infrastructures can bridge linguistic research, digital humanities, and language technology.