Language coverage

Which languages Anomalica supports, how they were chosen, and what coverage they provide.

Anomalica publishes in 30 languages covering approximately 80% of the world’s literate population. Languages are selected algorithmically by incremental coverage of literate people, using open data from the Unicode Common Locale Data Repository.

At each step, the language covering the most currently-uncovered literate people is chosen. The selection script, source data, and output are published in the project repository.

Cumulative coverage of literate world population

Teal bars show incremental literate population covered by each language. The copper line tracks cumulative coverage. Faded bars are languages not in the supported set.

Algorithmic ranking

Each language ranked by how many additional literate people it covers beyond all higher-ranked languages.

#	Language	Total speakers	Incremental	Cumulative	Coverage

The full ranking of 60 languages and the selection script are available in the project repository.

Editorial adjustments

Three languages from the algorithmic top 30 are excluded:

Javanese (rank 21, 32M) - primarily spoken; most literate speakers read Indonesian, supported at position 8
Malay (rank 27, 17M) - mutually intelligible with Indonesian in written form
Nigerian Pidgin (rank 30, 14M) - primarily spoken, not a standard written language for reference works

One language is added:

Ukrainian (rank 33, 13M) - the platform does not support Russian without Ukrainian during an active conflict between the two countries

28 translations produce 30 displayed languages. Traditional Chinese is a mechanical character conversion from Simplified Chinese, and American English is a spelling conversion from British English.

Translation quality

AI translation quality varies measurably by language. The WMT24 shared task (the standard academic benchmark for machine translation) found that large language models now outperform conventional translation systems across all 55 languages tested, but with a clear quality gradient.

WMT24 CometKiwi scores for English-to-target translation (higher is better, scale 0 to 1):

Language pair	Best LLM score	Quality
English to Japanese	0.762	Strong
English to Spanish	0.745	Strong
English to Russian	0.742	Strong
English to Ukrainian	0.732	Strong
English to German	0.723	Strong
English to Chinese	0.726	Strong
English to Hindi	0.657	Moderate

Hindi, the highest-resourced language in the Indic family, scores 0.06 to 0.10 lower than European and CJK languages. Languages with less training data (Burmese, Uzbek, Marathi) can be expected to show a larger gap, though published benchmark data for these specific languages is limited.

Based on available benchmarks, the supported languages fall into three quality tiers:

Strong: English, French, German, Spanish, Portuguese, Russian, Chinese, Japanese, Italian, Polish, Korean, Ukrainian
Moderate: Arabic, Hindi, Turkish, Vietnamese, Indonesian, Thai, Bengali, Urdu, Persian, Swahili, Tamil, Telugu, Tagalog, Marathi
Limited data: Burmese, Uzbek

For the third tier, published translation benchmarks are sparse. Meta’s NLLB-200 model scores Burmese lowest of all 28 languages on the FLORES benchmark (chrF++ 30.9 vs French at 69.6), though larger models perform substantially better than NLLB.

Translation corrections can be submitted through the content repository. Corrections are extracted as durable directives that persist across future article regeneration.

Sources: WMT24 General MT Ranking, WMT24++ 55-language expansion, NLLB-200 FLORES metrics.