hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How do I say this dataset is so garbage it shouldn't be used?

kpu opened this issue · comments

I downloaded a dataset, CCAligned-v1.en-mt. It has 37 sentences and maybe 2 are correct. How do I mark it as "do not use"?

Should we create something like blocklists that we simply add to the repository and then the UI shows a warning if a corpus is in that blocklist?

OpusCleaner is in essence a structured review of a dataset. We should be posting our reviews.

There are a few reasons I might say "do not use", such as being an older version of the same corpus with a different name e.g. MultiParaCrawl v5 appears for en-mt.

In the short term I just want to be able to note that category in OpusCleaner.