That’s how we crawl

We build our dataset from public Telegram channels using a snowball approach. Starting from a set of seed channels, we expand the dataset whenever forwarded messages reveal additional public channels that meet our inclusion criteria. Newly included channels are then collected across the defined time period and become part of the ongoing discovery process.*

A key challenge is to identify as many relevant channels as possible without letting the snowball grow so broad that the dataset becomes less focused and the collection process too resource-intensive. For this reason, we apply clear inclusion criteria based on language and channel size. We focus on European languages only and use language-specific minimum follower thresholds to keep the dataset balanced across languages.

These thresholds are derived from a reference dataset created from an earlier crawl. For most languages, the threshold is set at the 75th percentile of channel size, rounded to the nearest 500, with a general minimum of 1,000 followers. Russian is treated as an exception and requires at least 50,000 followers, because otherwise the scale of Russian-language Telegram usage would outweigh other languages in the dataset. The same 50,000-follower threshold also applies to channels classified as multilingual.

Channel language is determined with our separate language-classification method (see below).

Each language is seeded with 20 randomly selected channels from the top quartile of channels in that language. For some languages, however, the reference dataset contains only a limited number of eligible channels, which can constrain seed selection.

* The current collection period starts on 1 January 2025. Depending on operational capacity, the dataset may later be maintained on a rolling basis, with only the most recent two years of data retained and older messages removed.

How we classify channel language

We classify channel language in two steps. First, individual messages are preprocessed to remove typical Telegram noise such as URLs, handles, boilerplate phrases, view counters, and other non-linguistic elements. We then assess whether the remaining text contains enough linguistic substance to support reliable classification. Messages that are too short or consist mainly of emojis, symbols, links, or other low-information content are excluded at this stage. Only eligible messages are then classified with fastText (lid.176.bin).

In a second step, message-level results are aggregated at channel level. For this, we use a recent window of up to 40 messages per channel. Each message can contribute at most one language vote, based on its highest-confidence language label, and only if that label reaches a confidence of more than 0.30. This helps avoid overstating multilingualism when individual messages contain mixed or uncertain signals.

The channel-level result is then assigned using a small set of aggregation rules. A channel is classified as a single language if one language clearly dominates the valid message sample. This applies, for example, if only one language receives votes and reaches at least 3 valid message votes, or if one language reaches at least 4 votes, accounts for at least 60% of all valid votes, and leads the second-ranked language by at least 2 votes.

A channel is classified as multilingual if at least two languages are substantially represented and no single language dominates. In practice, this means that at least two languages must each receive 3 or more votes and account for at least 25% of valid votes, while the leading language remains below 60% of the total.

If neither pattern is met, or if no sufficiently reliable message votes are available, the channel receives no classification.