facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Cantonese (Yue Chinese, `yue_Hant`) data in FLORES-200 is not Cantonese at all

ayaka14732 opened this issue · comments

commented

The Cantonese (Yue Chinese, yue_Hant) data in FLORES-200 is completely wrong. The data is not Cantonese at all, but rather Mandarin Chinese in Traditional Chinese Script (zho_Hant), which only has stylistic differences compared to the zho_Hant data in the dataset.

Furthermore, the paper mentioned that the yue_Hant and zho_Hant data tend to be predicted as each other. It turns out that both datasets actually consist of zho_Hant data exclusively. yue_Hant and zho_Hant should actually be very easy to distinguish from each other.

Here is how correct yue_Hant data would look like:

Language Code Sentence
eng_Latn They found the Sun operated on the same basic principles as other stars: The activity of all stars in the system was found to be driven by their luminosity, their rotation, and nothing else.
zho_Hant 他們發現太陽的運作與其他恆星的基本原理相同:系統中所有恆星的活動均受其光度、自轉所推動,就是這麼簡單。
yue_Hant (wrong) 他們發現,太陽和其他恆星的運行原理是一樣的:系統中所有恆星的活動都是由它們的亮度、自轉驅動的,而並非其他因素。
yue_Hant (corrected) 佢哋發現,太陽其他恆星運行原理分別:系統入面所有恆星活動都淨係佢哋嘅亮度自轉推動,而包括其他因素。

(Bold denotes words that are used exclusively in yue_Hant)

This has been complaint by others for a long time https://twitter.com/chaakming/status/1555246138105614336

I guess nobody in the FLORES team knows Cantonese and Mandarin well enough to understand the unique situation of this language. The current data collected for yue is Hong Kong Chinese, NOT Cantonese. We recommend using this classifier to filter the real Cantonese data https://github.com/CanCLID/cantonese-classifier