embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

Home Page:https://arxiv.org/abs/2210.07316

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a Benchmark for African languages

KennethEnevoldsen opened this issue · comments

Linguistic Families and Proposed Languages:

Afro-Asiatic Languages

  • Arabic - ara (widely spoken in North Africa)
  • Amharic - amh (Ethiopia)
  • Somali - som (Somalia)
  • Tigrinya - tir (Eritrea and Ethiopia)
  • Hausa - hau (Niger-Congo but extensively uses Afro-Asiatic vocabulary, Nigeria and surrounding countries)

Niger-Congo Languages

  • Swahili - swa (East Africa)
  • Yoruba - yor (Nigeria and neighboring countries)
  • Igbo - ibo (Nigeria)
  • Akan - aka (Ghana)
  • Zulu - zul (South Africa)
  • Xhosa - xho (South Africa)
  • Lingala - lin (Congo)

Nilo-Saharan Languages

  • Nuer - nus (South Sudan)
  • Dinka - din (South Sudan)
  • Kanuri - kau (Nigeria, Niger)

Khoisan Languages (Known for their click consonants)

  • Khoekhoe - naq (Namibia, South Africa)

Creole Languages

  • Pidgin (Nigerian Pidgin) - pcm (Widely used in informal contexts across Nigeria)

Note this list does not claim to be comprehensive, do feel free to add to the list.