How AI pioneers teach ChatGPT & Co. their languages

AI tools, from ChatGPT to Google Translate, are useless to billions of people in the Global South who don't work in western languages. Researchers and startups from Africa and other parts of the world are changing that.

In this repository, you will find the methodology, data and code behind the story that came out of this analysis.

Story by Kira Schacht, additional interviews by Hanna Demissie.

Read the full article on DW.com: English | German

Dataset

The file Data.xslx contains the data our analysis is based on. For further information about its structure please refer to the metadata sheet in the file.

Data sources

Common Crawl

The Common Crawl is an openly available dataset consisting of billions of web pages from the internet. It is an important data source for many AI models: For instance, about 60% of the examples used to train ChatGPT's version 3.5 come from this collection.

Starting in 2018, Common Crawl analyzed the language of each page in its dataset. An overview downloaded from their GitHub provides the 3-letter language code, number and share of pages for each language.

As of now, Common Crawl is able to identify 160 different languages and up to 3 languages per document. 51% of the total web pages in the dataset have an unkown language, mostly those scraped before 2018. After this time, only around 2-3% of pages have no specified language.

For this story, we analyzed all pages with known languages. Almost half of these are in English alone. Below are the top 10 languages in the Common Crawl:

language	pages	share
English	56.562.220.256	46%
Russian	9.075.940.671	7%
Chinese	7.304.517.478	6%
German	7.033.437.439	6%
Japanese	6.127.097.773	5%
French	5.752.349.660	5%
Spanish	5.428.973.977	4%
Italian	3.012.638.667	2%
Portuguese	2.520.457.387	2%
Dutch	2.272.085.034	2%

Dataset downloaded on 14th July 2023.

Ethnologue

Ethnologue is a global database of languages operated by Christian NGO SIL International. It provides extensive data for over 7,000 languages worldwide.

We scraped the publicly available data on their database, which offers, among other things:

Language code and name
A rough categorization of how many native speakers a language has (> 1B, 1M-1B, 10K-1M, <10k, None)
Vitality (institutional, stable, endangered, or extinct)
Level of digital language support
Language family
Main country and region

Please refer to the Ethnologue language pages (e.g. here) for more information about their categorizations.

We used the Ethnologue database to assign language titles to the 3-letter language codes used in the Common Crawl dataset, as well as to get information about the Digital Support Level, Vitality and Geography of languages with more than 1 million speakers.

Wikidata

To compare each language's representation in the Common Crawl with its speakers, we needed exact information on the number of native speakers for each language logged in the Common Crawl. Since official Databases like Ethnologue only offer a category, we sourced the figures from Wikidata. The exact query and its output can be found here.

Where available, we manually picked the number of native speakers for each language from the query results. Languages that could not be found via Wikidata were supplemented manually, which is indicated in the column manual of the wikidata tab in the data file.

Interview partners

Asmelash Teka Hadgu, founder of Lesan.AI
Mekdes Gebrewold, founder of Ashagari consultancy

dw-data / ai-languages