hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Discussion] Filter that checks for numerical sequences?

XapaJIaMnu opened this issue · comments

Do we want a filter that checks for the presence of numerical sequences on both sides?
Looking through CCAligned, there's some places where numbers are present on one side, but absent on another, which suggests that the two sentences are not parallel. There are cases where numbers would differ on both sides (Currency conversions/imperial-metric system shenanigans etc).

Do we have a rule for that somewhere in bicleaner? Has anyone experimented with that @jelmervdl @ZJaume ?

Bicleaner Hardrules already has that rule (disabled by default), we could use it.

commented

I also have a stand-alone filter that does the same thing with some wiggle room.

I also have an attempt at one that fixes it if there's a mismatch so we don't have to throw the pair away but that was a bit of a failure. Too hard to get right.

Ok, so we already have it and I didn't find it, so i think we can close this.