hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Refactor filters as transfomers & scorers

jelmervdl opened this issue · comments

In hindsight, OpusFilter had the right idea here.

OpusCleaner right now has filters, which take lines on their stdin and produce lines on their stdout. This model is really simple, and matches what you normally do in bash scripts. But it also makes it so that you can't assume anything (or validate anything) about the output.

In practice, filters can do a combination of three things:

  1. Transform lines: number of lines in input and output are the same, but the content of each individual line might have been altered. Think transl(iter)ation, or fixing orthogonal errors.
  2. Remove lines: short lines, whitespace, low scoring lines, etc.
  3. Add lines: uncommon, but think of a sentence splitter. bifixer can do this for example.

Filters that do 1 can be executed in a parallel streaming fashion so that's nice. You can also use tools like col.py to write them only on a single column (so you don't have to do column parsing in your filter 🎉)

Filters that do 2 can be rewritten as 1 but instead of changing the output, they score each line in the input. Thresholding can then be done using threshold.py which can keep a score cache, and the frontend could present a histogram to make it easier to pick a threshold. For filters that just remove empty lines, this could be a simple 0 for empty, 1 for non-empty. For language filters you can use prediction scores. Bicleaner also works with scores like this.

Additionally, if the frontend knows a filter is a type 1 or type 2, it can make better choices in how to present the diff.

Filters that do 3 are a bit of a pain, but also uncommon. I think we can keep a sort of legacy support for these types of filters.

Type 3 filters can still be used to also still support the filters that do too much (i.e. both transform and filter). And we can discourage these filters by not giving them any of the fancy interface or performance benefits.