Speakers
Description
Corpus-based conceptual analysis for the Humanitarian Encyclopedia (HE) grapples with vast amounts of lexical data to describe the meaning of key humanitarian notions and detect conceptual variation among actors (Odlum & Chambó, 2022). By building on Frame-based Terminology (Faber, 2015, 2022), the HE is incorporating qualitative methods necessary to subsume lexical data into manageable semantic triples in a way that ensures the traceability and transparency of modeling decisions.
While traditional inductive qualitative analysis is labor-intensive, researchers are now replicating these methods using LLM-assisted workflows. Following this trend, our paper presents an observational study with a dataset of 274 spans labeled as causes of forced displacement that were manually annotated on a random sample of 1,000 concordances obtained from an English corpus of humanitarian documents from ReliefWeb (Isaacs et al., 2024). In this initial assessment, we test LLM inductive categorization using four models locally: Magistral Small 1.0 (Mistral-AI et al., 2025) with 24 billion parameters and three DeepSeek R1 models (DeepSeek-AI, 2025), with 8, 32 and 70 billion parameters. They are evaluated against a manual categorization comprising 34 causality groupings produced by two annotators through consensus.
To assess baseline similarities, we provide models with minimal, zero-shot instruction, while also requiring structured outputs and conducting 40 runs per model (10 runs per text format: lines, CSV rows, JSON dictionary and Python list). We evaluate model fitness by measuring (1) degree of task completion, (2) category assignment similarity to the gold standard and (3) semantic overlap of LLM-generated category labels with those in the gold standard. For category assignment similarity, multiple Jaccard similarity scores were converted into a single normalized measure. Category labels from the top ten runs (those exhibiting the highest degree of category assignment similarity) demonstrated semantic overlap with manual labels. Nevertheless, the results were mixed: some LLM-generated labels were invalid, whereas others, although absent from the gold standard, were considered pertinent by the annotators.
In conclusion, models displayed low overall similarity scores when given little instruction and hundreds of spans to classify in one batch, consistently omitting spans despite being prompted not to do so. Outlier runs achieved similarity scores comparable to annotators, while revealing useful insights not captured in the manual categorization. The results underscore the complexity of categorizing data for a single, domain-specific concept. However, this also highlights the potential of LLMs as complementary tools for qualitative analysis tasks in the conceptual analysis workflow of the HE. Future work will investigate multi-category tasks, hybrid human-in-the-loop approaches, refined prompting strategies, and additional pre- and post-processing of lexical data.