8–12 Oct 2024
Hotel Croatia
Europe/Warsaw timezone

Innovation in Phraseomatics: DiCoP Project and DiCoP-Text Corpus for the Enrichment of Language Models and Automatic Translation

9 Oct 2024, 17:00
1h 30m
Tihi salon (Hotel Croatia)

Tihi salon

Hotel Croatia

Speakers

Lian Chen Wenjun Sun Flora Badin

Description

This article examines advances in phraseomatics and digital phraseography through the DiCoP project and its DiCoP-Text corpus, aimed at enriching linguistic models and machine translation. The project evaluates the frequency of use of phraseological units (PUs) and improves their translation in different contexts, drawing on recent research in phraseotranslation and natural language processing (NLP). It emphasizes French-Chinese and Chinese-French language pairs. We integrated 549 PUs from the novel The Three-Body Problem by Liu Cixin for our tests. Various processes, such as tokenization, identification, alignment, and annotation, were used to improve the translation of PUs. DiCoPText, a comprehensive database including newspaper articles, literary works, and textbooks, aims to enhance the performance of language models (LMs).

Co-authors

Presentation materials

There are no materials yet.