facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems in Catalan files (encoding / conversions)

jordimas opened this issue · comments

In the dataset zip file, the file dev/cat.dev contains sentences like:

Dilluns, cientfics de la Facultat de Medicina de la Universitat de Stanford van anunciar la invenci

which should be:

Dilluns, científics de la Facultat de Medicina de la Universitat de Stanford van anunciar la invenció

All the special charters (like accents, apostrophe, etc) are missing.

The file cat.devtest has the same problem

The Spanish or Galician language files (which use the same encoding are both provide from Latin) do not have this problem.

Hey @jordimas
Thanks so much for looking into the dataset and raising this issue. We can't even mention how useful these feedbacks are!

We found the issue that you mentioned in the file that could have originated due to some error in formatting. We also double checked that only Catalan file has this issue. We are working on the fix and update you as soon we have a fix.

Thanks so much for quick response and fix.

This dataset is pure gold for low resource languages. Thanks so much for your work.

The dataset is updated with the Catalan file fixed now. Thanks again for reporting this issue

It works. Just checked. Thanks!