Problems in Catalan files (encoding / conversions)
jordimas opened this issue · comments
In the dataset zip file, the file dev/cat.dev contains sentences like:
Dilluns, cientfics de la Facultat de Medicina de la Universitat de Stanford van anunciar la invenci
which should be:
Dilluns, científics de la Facultat de Medicina de la Universitat de Stanford van anunciar la invenció
All the special charters (like accents, apostrophe, etc) are missing.
The file cat.devtest has the same problem
The Spanish or Galician language files (which use the same encoding are both provide from Latin) do not have this problem.
Hey @jordimas
Thanks so much for looking into the dataset and raising this issue. We can't even mention how useful these feedbacks are!
We found the issue that you mentioned in the file that could have originated due to some error in formatting. We also double checked that only Catalan file has this issue. We are working on the fix and update you as soon we have a fix.
Thanks so much for quick response and fix.
This dataset is pure gold for low resource languages. Thanks so much for your work.
The dataset is updated with the Catalan file fixed now. Thanks again for reporting this issue
It works. Just checked. Thanks!