shuffle data
kpu opened this issue · comments
I think the data isn't shuffled, or at least ECB has consecutive sentences. Shouldn't I be looking at a random representative sample of the data?
This is really bad for e.g. TildeMODEL-v2018.en-mt
which is sorted.
The sample is intentionally N lines of head, n lines sampled from the middle (randomly selected, but in the order they appeared in) and N lines from the tail.