embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

Home Page:https://arxiv.org/abs/2210.07316

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a Benchmark for Asian Languages

KennethEnevoldsen opened this issue · comments

Linguistic Families and Proposed Languages:

East Asian Languages

  • Chinese (Mandarin) - cmn
  • Cantonese - yue (#370)
  • Japanese - jpn
  • Korean - kor
  • Mongolian - mon

South Asian Languages

Indic Languages:

  • Hindi - hin

  • Bengali - ben

  • Punjabi - pan

  • Marathi - mar

  • Gujarati - guj

  • Urdu - urd

  • Nepali - nep

  • Sinhala - sin

  • Tamil - tam

  • Telugu - tel

  • Kannada - kan

  • Malayalam - mal

  • Dravidian Languages:

    • Included above (Tamil, Telugu, Kannada, Malayalam)

Southeast Asian Languages

  • Austronesian Languages:
    • Indonesian - ind
    • Filipino - fil (#472 )
    • Malay - msa
    • Javanese - jav
  • Tai-Kadai Languages:
    • Thai - tha
    • Lao - lao
  • Austroasiatic Languages:
    • Vietnamese - vie (see #364)
    • Khmer - khm
  • Burmese - mya

Central Asian Languages

  • Turkic Languages:
    • Kazakh - kaz
    • Uzbek - uzb
    • Turkmen - tkm
    • Kyrgyz - kir
    • Uighur - uig

West Asian (Middle Eastern) Languages

  • Semitic Languages:
    • Arabic - ara
    • Hebrew - heb
  • Iranian Languages:
    • Persian - fas
    • Kurdish - kur
    • Pashto - pus
    • Dari - prs

Note this list does not claim to be comprehensive, do feel free to add to the list.

I will take a stab at a Bengali benchmark together with a colleague of mine 👍

Wonderful @rasdani feel free to create an issue on this as well so that others can see that you are working on it.