openlanguagedata / flores

The FLORES+ Machine Translation Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing information for Sardinian (`srd_latn`)

jeanm opened this issue · comments

commented

Sardinian consists of different varieties and has multiple orthographies. Both FLORES+ and Seed are missing information on which ones exactly are used in the data.

When the variety isn't specified, you should use Standardized Sardinian, called LSC "Limba Sarda Comuna", which has been created exactly for this purpose. Therefore, it is particularly suitable for FLORES+: cross-translation, standardization of a written form, and preservation of an endangered language. LSC offers many advantages for this scope: it is the official version that the local government, "Regione Autonoma Della Sardegna," uses for laws and official communications. Additionally, many Wikipedia pages and books are already in LSC, and numerous grammar guidelines are available for this variety. Cultural associations that use and teach Sardinian also utilize written LSC. Lexically, LSC already encompasses all varieties, considering different words from Logudoresu, Campidanesu, Nuoresu, etc., that indicate the same thing as synonyms. We can begin with LSC Standardized Sardinian and simultaneously request to add at least the two main varieties (Logudoresu and Campidanesu) to the future FLORES+ list, making translation from LSC easier than from Italian.

In my opinion, if we start discussing and rejecting Standard Sardinian, we risk ending up as Duolingo: there was widespread disagreement, and after a decade, the request to add the language to the courses list was declined. The language is already endangered enough to risk not being properly included in FLORES+ and therefore missing the opportunity to be part of projects like No Language Left Behind (NLLB).

commented

Hi @srfro! Thank you, this is very valuable feedback. We are organising a shared task at WMT24 with the purpose of improving/extending this data. Would you be interested in participating?

commented

Hi @srfro, are you still interested in adding Limba Sarda Comuna, Logudoresu and Campidanesu? As I was mentioning in my previous message, we are running a shared task at WMT24 with the specific purpose of extending the datasets and improving existing data. We are asking people to indicate interest by 20th May. If this is still of interest to you, free to get in touch at info@oldi.org, we would be happy to assist.