Dataset Problem.

Question

Dataset Problem.

ShadowVicky opened this issue 2 years ago · comments

In the paper , you wrote in the assamese language you have 738k mono text and 43.7k Bitext. But we are geeting only 1912 assamese-english pair data. Can you pls provide us the whole dataset i.e mono 738k and 43.7k Bitext. It will really helpful for us. Thanking you in advanced.

Guillaume Wenzek · Answer 1 · Mon Oct 17 2022 16:10:58 GMT+0800 (China Standard Time)

Hi, which paper are you referrering to ? Where are you downloading the data from ?
I think there is a confusion between train/test data.
With NLLB200 paper we shared some training data extracted from web corpus. You can download it from there: https://huggingface.co/datasets/allenai/nllb

The 1912 bitext is probably the dev + devtest portion of Flores200 dataset. Those translations are meant for evaluation, not training.

Shadow Vicky · Answer 2 · Mon Oct 17 2022 19:15:19 GMT+0800 (China Standard Time)

As mentioned in the abstract of "The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation", Flores-101 provides 3,001 English sentences translated to other languages (including Assamese). On downloading it from
1."https://github.com/facebookresearch/flores/tree/main/flores200",
2.https://huggingface.co/datasets/gsarti/flores_101
we get two sets: dev and devtest, each with 997 and 1012 sentences for various languages. Also, the paper mentions about a 43K bitext (Assamese, Bitext w/ En) and 738K mono text.

Question: How can we get the 43K bitext, 738K monotext and the 3,001 benchmark set?

Guillaume Wenzek · Answer 3 · Mon Oct 17 2022 21:08:35 GMT+0800 (China Standard Time)

we get two sets: dev and devtest, each with 997 and 1012 sentences for various languages.

That's expected. The test set is secret and you won't be able to download it. That's why you only have 2000 sentences and not 3000.

The bitext mention in this paper can be found on statmt.org: https://data.statmt.org/cc-matrix/