How do I say this dataset is so garbage it shouldn't be used?
kpu opened this issue · comments
I downloaded a dataset, CCAligned-v1.en-mt. It has 37 sentences and maybe 2 are correct. How do I mark it as "do not use"?
Should we create something like blocklists that we simply add to the repository and then the UI shows a warning if a corpus is in that blocklist?
OpusCleaner is in essence a structured review of a dataset. We should be posting our reviews.
There are a few reasons I might say "do not use", such as being an older version of the same corpus with a different name e.g. MultiParaCrawl v5 appears for en-mt.
In the short term I just want to be able to note that category in OpusCleaner.