Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

Modeling and structuring of a bilingual French-Chinese phraseological dictionary: neural automatic approach for ontology and lexicography

Nov 18, 2025, 12:30 PM
30m
Zrak hall

Zrak hall

Speaker

Lian Chen

Description

ONLINE PRESENTATION

The creation of ontologies—traditionally the domain of linguists and knowledge engineers—is undergoing a significant transformation thanks to advances in artificial intelligence and natural language processing (NLP). These developments open new avenues for phraseology, a field where multi-word expressions (MWEs)—often opaque and non-compositional—must be identified, classified, and linked to abstract concepts or discourse contexts (Constant 2012: 6). Despite their linguistic richness, idiomatic expressions remain a major challenge for NLP due to their syntactic variability, semantic ambiguity, and context-dependence (Gross 1996; Mejri 1997; Polguère 2002; Chen 2021).

This study presents an approach for modeling a bilingual French–Chinese phraseological dictionary by combining lexicographic theory, ontology design, and neural NLP techniques. We focus specifically on idiomatic expressions related to the human body and animals, domains in which words such as main (hand) can carry both literal and figurative meanings—e.g., as symbols of work, strength, or authority (Rey & Chantreau 2003; Rey 2019).

To overcome the limitations of manual ontology construction tools like Protégé (Kapoor & Sharma, 2010), we follow the principles of the Ontology Layer Cake (Despres & Szulman 2008; Tiwari & Jain 2014) and implement a semi-automatic pipeline. Our methodology includes: (1) statistical extraction of idioms using TF-IDF, PMI, and RAKE; (2) syntactic filtering of candidate MWEs; (3) visualization and annotation through an interactive Streamlit interface; (4) semantic relation modeling using fine-tuned neural models (BilBERT and Sentence-BERT); and (5) export in OWL/RDF format using the OntoLex-Lemon standard, with SKOS for conceptual hierarchies and VarTrans for bilingual alignments.

A central challenge lies in extracting semantic triplets of the form (idiom, keyword, relation)—e.g., donner un coup de main → (main, aide)—which requires addressing the idioms’ non-compositionality, structural variation, and semantic opacity. We rely on syntactic grammars (Tesnière 1959), semantic mapping, and machine learning to formalize these triplets into interpretable ontological structures (Chen & Gasparini 2025).

The resulting resource is a multilingual, interoperable, and dynamic dictionary of idiomatic expressions, accessible via an interface that supports exploration, sorting, and export to Protégé or SPARQL-compatible systems. This work bridges NLP and lexicography, contributing to AI-enhanced auto-lexicography, semantic modeling, and the generation of context-aware bilingual examples (González-Rey 2002; Mel’čuk 2008, 2011; Mejri 2011; Sułkowska 2016; Chen 2023).

Our project aims to achieve six interconnected objectives. First, we design a semi-automatic pipeline for extracting and identifying idiomatic expressions from authentic French corpora, with a particular focus on thematic categories such as the human body and animals. Second, we construct semantic triplets that link idioms to keywords and conceptual categories, enabling fine-grained semantic interpretation. Third, we fine-tune a multilingual BERT-based model (BilBERT) to classify the semantic relations between idioms and their components. Fourth, we formally model the extracted data as an ontology using the OntoLex-Lemon framework, enriched with SKOS hierarchies and VarTrans modules to support bilingual alignment with Chinese equivalents. Fifth, we develop an interactive Streamlit interface that allows users to visualize idiomatic relationships, perform manual annotations, and export the data in RDF/OWL format. Finally, our project contributes to ongoing research in multilingual phraseology and AI-assisted lexicography, offering practical tools and resources for Semantic Web applications and advanced NLP tasks.

Here are several illustrations of the results obtained throughout the project, including visualizations of idiomatic triplets, conceptual mappings, and semantic graphs generated during the modeling and classification phases.

Presentation materials

There are no materials yet.