facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extra zero-width characters in the dataset

bact opened this issue · comments

There are some instances of zero-width characters in the dataset, for example:

  • Zero-width space (ZWSP / U+200B) at line 236 of dev/tha.dev; line 191 and 680 of devtest/tha.dev
  • Repeating Zero-width non-joiners (ZWNJ / U+200C) at line 121 of dev/pus.dev (a ZWNJ could be a valid one, but in this case it appears 7 times in a row.)

While ZWSP is less likely to do anything with semantic (more likely to be about typesetting, but sometimes used by a word processor as a word delimiter for language that does not has a space between words), ZWNJ could affect meaning of words.

This can be cleaned by the user themselves and, with the very small amount, is negligible in the evaluation. So probably no need to take any change in the dataset (or a low priority one) but a good to know for anyone who like to process it. Just a note.