Display languages information in the scoreboard

Question

Display languages information in the scoreboard

david-koleckar opened this issue a month ago · comments

It would be very beneficial, for multilingual models, to display the percentage of data in given language.
eg.
{english: 0.82, chinese: 0.13, french: 0.05 }

It is very hard now to validate the training data language percentages, one has to go though the datasets, etc. and sometimes the information is not even present.

So, it is very hard to navigate in the scoreboard, if you need to choose multilingual model, because multilingual quality is poorly defined.

Kenneth Enevoldsen · Answer 1 · Mon Apr 29 2024 20:18:49 GMT+0800 (China Standard Time)

Hi @david-koleckar, we agree that the multilingual benchmarks are poorly defined, we are currently working on extending MTEB to the multilingual case (MMTEB).

I am unsure if you want the documentation for the benchmark tab or for the model itself?

David Kolečkář · Answer 2 · Mon Apr 29 2024 21:01:20 GMT+0800 (China Standard Time)

That's a great news!
Well both would be great. Indicate somehow in the benchmark how much (what languages, etc.) is model multilingual. And more detailed information regarding the datasets it was trained on the model page itself. It would be awesome if HF made this mandatory. Now it is struggle sometimes to find out, what datasets were used an how. Even though it is such a paramount part of model card.

Kenneth Enevoldsen · Answer 3 · Mon Apr 29 2024 21:45:12 GMT+0800 (China Standard Time)

Well both would be great. Indicate somehow in the benchmark how much (what languages, etc.) is model multilingual.

I don't think we will do this. It is up to the model to denote so we will leave that to the datasheet. That is really up to the model creators rather than the benchmark. Though I agree that it would be great to have.

David Kolečkář · Answer 4 · Wed May 01 2024 00:27:03 GMT+0800 (China Standard Time)

And which languages would be present in MMTEB ?

Kenneth Enevoldsen · Answer 5 · Wed May 01 2024 03:59:32 GMT+0800 (China Standard Time)

Atm. I believe more than 100 different languages. You can do:

tasks = mteb.get_tasks()

for task in tasks:
   print(task.langauges)

to get an overview of all current tasks, but likely more will be added and potentially some removed before we end up til a final version of the benchmark.