hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

shuffle data

kpu opened this issue · comments

I think the data isn't shuffled, or at least ECB has consecutive sentences. Shouldn't I be looking at a random representative sample of the data?

This is really bad for e.g. TildeMODEL-v2018.en-mt which is sorted.

commented

The sample is intentionally N lines of head, n lines sampled from the middle (randomly selected, but in the order they appeared in) and N lines from the tail.