hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Configure the diff view to select diffs between different steps

jindrahelcl opened this issue · comments

When the dataset contains a consistent but easy-to-clean noise (e.g. space at the end of every line), running a filter that removes the space will render the whole diff trivial (everything was deleted and then everything was inserted). This practically means that I don't have a way of checking what any of the following filters are doing.

I think the changes UI should let you choose which two points in the cleaning pipeline are you actually diffing.

I can apply the filter manually but that's not a sustainable solution, mainly for bigger datasets.