shuffle data

Question

shuffle data

kpu opened this issue a year ago · comments

I think the data isn't shuffled, or at least ECB has consecutive sentences. Shouldn't I be looking at a random representative sample of the data?

Kenneth Heafield · Answer 1 · Sun Aug 20 2023 05:50:37 GMT+0800 (China Standard Time)

This is really bad for e.g. TildeMODEL-v2018.en-mt which is sorted.

Jelmer · Answer 2 · Mon Aug 21 2023 19:07:42 GMT+0800 (China Standard Time)

The sample is intentionally N lines of head, n lines sampled from the middle (randomly selected, but in the order they appeared in) and N lines from the tail.