embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

Home Page:https://arxiv.org/abs/2210.07316

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Display languages information in the scoreboard

david-koleckar opened this issue · comments

It would be very beneficial, for multilingual models, to display the percentage of data in given language.
eg.
{english: 0.82, chinese: 0.13, french: 0.05 }

It is very hard now to validate the training data language percentages, one has to go though the datasets, etc. and sometimes the information is not even present.

So, it is very hard to navigate in the scoreboard, if you need to choose multilingual model, because multilingual quality is poorly defined.

Hi @david-koleckar, we agree that the multilingual benchmarks are poorly defined, we are currently working on extending MTEB to the multilingual case (MMTEB).

I am unsure if you want the documentation for the benchmark tab or for the model itself?

That's a great news!
Well both would be great. Indicate somehow in the benchmark how much (what languages, etc.) is model multilingual. And more detailed information regarding the datasets it was trained on the model page itself. It would be awesome if HF made this mandatory. Now it is struggle sometimes to find out, what datasets were used an how. Even though it is such a paramount part of model card.

Well both would be great. Indicate somehow in the benchmark how much (what languages, etc.) is model multilingual.

I don't think we will do this. It is up to the model to denote so we will leave that to the datasheet. That is really up to the model creators rather than the benchmark. Though I agree that it would be great to have.

And which languages would be present in MMTEB ?

Atm. I believe more than 100 different languages. You can do:

tasks = mteb.get_tasks()

for task in tasks:
   print(task.langauges)

to get an overview of all current tasks, but likely more will be added and potentially some removed before we end up til a final version of the benchmark.