8–12 Oct 2024
Hotel Croatia
Europe/Warsaw timezone

Annotation of Non-Lexical Entities in Croatian Health Forum Entries With Large Language Models

9 Oct 2024, 17:00
1h 30m
Tihi salon (Hotel Croatia)

Tihi salon

Hotel Croatia

Speakers

Amila Kugic Markus Kreuzthaler Stefan Schulz

Description

Introduction
Technical languages contain expressions that are not universally understood. We call non-lexical entities (NLEs), i.e., single- or multi-word expressions not listed in domain dictionaries. These are especially difficult to differentiate from lexical entities, when domain dictionaries are small or incomplete, which is often the case for low-resource languages. The medical domain further complicates this issue through specialized written language, jargon expressions, use of multiple languages, and containing various entities that are not existent in domain lexica. This work will focus on four NLE categories in the Croatian medical domain: (i) short-forms, i.e., abbreviations or acronyms, (ii) deviations from standard spelling, lexical variants, misspellings, mistyping, (iii) brand names, and (iv) proper names. The choice of categories was influenced by related works in various languages underlining the challenges of short-form ambiguity (Schwarz et al., 2021) structure, abbreviations and conformity to the Austrian Electronic Health Records (ELGA, and information retrieval tasks (Gendrin et al., 2023; Raja, 2022). Large language models (LLMs) offer opportunities with the contextual understanding present in the trained language models to identify NLEs and annotate these portions of text without following a named entity recognition (NER) approach, but instead prompting for the solution with tools, such as ChatGPT, an LLM for text generation and text synthesis.

Dataset
The dataset consists of health forum entries in Croatian, crawled from online websites and annotated by a domain expert. These texts blend typical lay language with pasted fragments of clinical documents, with the aim to receive physician authored advice or answers. NLEs make processing texts harder, as most smaller language models are trained on text data that use standardized language from domain dictionaries. Furthermore, NLEs introduce ambiguity into texts and extracting the context is necessary to be able to distinguish between dictionary content and NLEs, e.g., OPIS (‘description’) in uppercase letters appearing in a text could either be a section header or an abbreviation. The differentiation can only be fully extracted from the context. The dataset is split into 80% training, 10% validation, and 10% test set. The four mentioned groups were annotated and exported in a BIO (beginning-inside-outside)-labeling format for sequence modeling. Expressed uncertainty among forum users further served as motivation for the choice of NLE categories.

Methodology
For a comparative baseline without LLMs, a NER approach with fine-tuned BERT (Devlin et al., 2019) and ELECTRA (Clark et al., 2020)they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute models was performed. To accomplish the automatic identification and annotation of NLEs, ChatGPT was prompted for each run-through with the prompt seen in Figure 1. Fine-tuning, in the context of LLMs, refers to the injection of training examples as prompt answers to create a specialized downstreamed language model of the given LLM, which is then applied for prompting. First, the model ‘gpt-3.5-turbo’ was applied, without finetuning. The training dataset was utilized for manual prompt engineering. Second, the model ‘gpt-3.5-turbo-1106’ was fine-tuned twice with different sized subsets of the training dataset. The investigation was focused on the impact fine-tuning has on the performance and limiting the amount of training samples for cost reasons. All methods were evaluated with precision, recall and F1-measure for exact prediction matches.

Results
The baseline approaches reached an exact prediction of NLEs with an F1measure of 0.88 (BERT) and 0.91 (ELECTRA). Prompting for the annotation of Croatian health forum entries without fine-tuning resulted in poor quality results, i.e., an F1-measure of 0.48, where most NLEs were not identified. Through finetuning, the performance increased to an F1-measure of 0.82, even with only 100 sentences, pprox.. 98,000 tokens. In the final fine-tuning step, with 1,000 sentences of the training set, less than one third of the full training dataset, similar results as the baseline in F1-measure were reached with the multilingual BERT and specialized ELECTRA models (see Table 2), while surpassing the baselines in precision.

Discussion
The best model was able to outperform both baselines in precision, and outperform the multilingual BERT baseline in F1-measure, but still fell slightly short of the specialized ELECTRA model. This suggests that LLMs can be finetuned with only a third of the data in comparison to the baseline methods to reach state-of-the-art results. The largest portion of errors with the non-finetuned language model stemmed from insufficient contextual understanding of the prompt, leading to misclassification of NLEs as lexical entities. Fine-tuning significantly reduced these errors, particularly in differentiating lexical variants. A detailed error analysis revealed that misspellings, mistyping, and diacritic variations posed challenges, possibly due to the composition of pre-trained LLM datasets, primarily in English. Lexical variants emerged as the most error-prone NLE category, followed by short-form content, while brand names and proper names surpassed baseline models. Given the complexity of medical terminology and limited resources for Croatian medical texts, this approach is relevant for the identification, classification, and inclusion of NLEs into domain dictionaries, as well as for automating language resource creation.

Conclusion and Outlook
Our findings reveal ChatGPT’s potential for automating labeling in the Croatian medical domain, while reaching similar results as state-of-the-art approaches with less data needed. Automated annotations can enhance datasets for low resource languages, increase annotated dataset creation and expansion for dictionaries, and reduce human annotation hours. Future work will involve extending this workflow, optimizing fine-tuning, and employing natural language processing techniques to further process the identified NLEs.

Co-authors

Presentation materials

There are no materials yet.