hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How does the num_mismatch filter tokenize?

kpu opened this issue · comments

The num_mismatch filter threw out this sentence pair:
"Procès-verbal of rectification to the Convention on jurisdiction and the recognition and enforcement of judgments in civil and commercial matters, signed at Lugano on 30 October 2007"
"Procès–verbal ta’ Rettifika tal-Konvenzjoni dwar il-ġurisdizzjoni u r-rikonoxximent u l-eżekuzzjoni ta’ sentenzi f’materji ċivili u kummerċjali, iffirmat f’Lugano fit-30 ta’ Ottubru 2007"

It kept this sentence pair:
"(Official Journal of the European Union L 147 of 10 June 2009)"
"(Il-Ġurnal Uffiċjali tal-Unjoni Ewropea L 147 tal-10 ta’ Ġunju 2009)"

Unsure why "tal-10" was ok to match 10 but "fit-30" was not ok to match 30.

commented

Heh, it doesn't. It just searches for numbers:

nums_left, nums_right = (set(map(normalize, re.finditer(r'(?P<sign>[-+]?)(?:0*)(?P<value>\d+(?:[\.,]\d+)*)', col))) for col in cols[:2])

The first one isn't accepted because 2007 matches, but 30 and -30 doesn't: 1 / 2 < 1.0

The second is matching: 147, 2009 vs non-matching: 10 , -10: 2 / 2 >= 1.0 so it is accepted.

I'm a bit unhappy about how the ratio thing makes it less predictable than if it where just a simple yes/no, but I didn't want to throw out sentences with a bunch of good number examples just if it gone a single one wrong.

Since dashes are used for both signs and punctuation I'm tempted to get rid of the the sign has to match rule that I recently added.

Proposing to replace the regexp with (?P<sign>(?<=\s)[-+])?(?:0*)(?P<value>\d+(?:[\.,]\d+)*)\b:

  • matches 30 in fit-30
  • but -30 in it is -30 degrees
  • does not match 40something

For context: the current expression is (?P<sign>[-+]?)(?:0*)(?P<value>\d+(?:[\.,]\d+)*)

Edit: oh, it doesn't match -30 at the beginning of a sentence. Need to fix that.