Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

From Word of the Year to Word of the Week: Daily-updated Monitor Corpora for 25 Languages

Nov 19, 2025, 5:30 PM
30m
Sonce hall

Sonce hall

Speakers

Ondřej Herman Miloš Jakubíček Jan Kraus Vít Suchomel

Description

This paper presents a long-term privately-funded programme focusing on collecting of timestamped monitor corpora in a wide range of (currently 25) languages. These corpora are primarily designed for researching linguistic trends (including neology) and language change over time. They are available through the Sketch Engine platform and vary significantly in size — from 3 million tokens for Irish to over 100 billion tokens for English. The languages currently included are Arabic, Catalan, Chinese, Czech, Danish, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Irish, Maltese, Norwegian, Persian, Polish, Portuguese, Russian, Slovak, Slovene, Spanish, Tamil, and Ukrainian; new languages are continuously being added (with Afrikaans, Amharic, Armenian, Azerbaijani, Georgian, Igbo, Indonesian, Oromo, Urdu, Uzbek and Yoruba being the next candidate set of further 10 languages to be added in the coming months).

The corpora are constructed from news articles published on websites worldwide that offer content via newsfeeds (in the form of RSS and Atom formats). Data coverage ranges from as early as 2014 for the oldest corpora to 2023 for the most recently introduced languages. New data is being collected on a daily basis and an update for each trend corpus is published twice a week. The current work builds on the previously published JSI Newsfeed Corpus (Krek & Herman, 2017), which provided news content only until 2022. Since 2021 for English and 2023 for other languages, the data collection process has been carried out independently on the previous work, expanding the number of supported languages and incorporating new data sources. Sketch Engine already contains extra functionalities that are available to corpora with diachronic annotation. Our trend corpora offer analysis on daily, monthly, quarterly or yearly basis, and besides the dedicated Trend function in Sketch Engine (Kilgarriff, 2015) such metadata can be used to refine a lexicographer’s analysis in a concordance search, wordlist discovery or collocational behavior of words provided by theWord Sketch feature.

Nearly 30,000 newsfeeds are queried six times a day, yielding up to 180,000 new articles on weekdays and more than 110,000 articles on weekends per day. The publication date is extracted from the information supplied by the feed, ensuring time-stamping as accurate as possible. The processing pipeline includes several web text cleaning procedures, namely the main text body extraction, removal of near-duplicates, and enriching the data with linguistic annotations, following methodologies similar to those used for the JSI Newsfeed Corpus and the TenTen corpora family (Kilgarriff, 2014).

In addition to corpus construction, the paper details statistics on feed activity – download volumes, the decay rate (how long an existing newsfeed typically lasts to work) – and the most represented websites per language. The paper also showcases examples of functionality offered by Trend corpora that support corpus lexicography and linguistic research, including neologism detection, word sense shift analysis, and timelinebased analysis of trending words and phrases.

Presentation materials

There are no materials yet.