hplt-project / OpusCleaner

The num_mismatch filter threw out this sentence pair:
"Procès-verbal of rectification to the Convention on jurisdiction and the recognition and enforcement of judgments in civil and commercial matters, signed at Lugano on 30 October 2007"
"Procès–verbal ta’ Rettifika tal-Konvenzjoni dwar il-ġurisdizzjoni u r-rikonoxximent u l-eżekuzzjoni ta’ sentenzi f’materji ċivili u kummerċjali, iffirmat f’Lugano fit-30 ta’ Ottubru 2007"

It kept this sentence pair:
"(Official Journal of the European Union L 147 of 10 June 2009)"
"(Il-Ġurnal Uffiċjali tal-Unjoni Ewropea L 147 tal-10 ta’ Ġunju 2009)"

Unsure why "tal-10" was ok to match 10 but "fit-30" was not ok to match 30.

Heh, it doesn't. It just searches for numbers:

OpusCleaner/opuscleaner/filters/num_mismatch.py

Line 18 in afd9bc7

    
           nums_left, nums_right = (set(map(normalize, re.finditer(r'(?P<sign>[-+]?)(?:0*)(?P<value>\d+(?:[\.,]\d+)*)', col))) for col in cols[:2])

The first one isn't accepted because 2007 matches, but 30 and -30 doesn't: 1 / 2 < 1.0

The second is matching: 147, 2009 vs non-matching: 10 , -10: 2 / 2 >= 1.0 so it is accepted.

I'm a bit unhappy about how the ratio thing makes it less predictable than if it where just a simple yes/no, but I didn't want to throw out sentences with a bunch of good number examples just if it gone a single one wrong.

Since dashes are used for both signs and punctuation I'm tempted to get rid of the the sign has to match rule that I recently added.

Proposing to replace the regexp with (?P<sign>(?<=\s)[-+])?(?:0*)(?P<value>\d+(?:[\.,]\d+)*)\b:

matches 30 in fit-30
but -30 in it is -30 degrees
does not match 40something

For context: the current expression is (?P<sign>[-+]?)(?:0*)(?P<value>\d+(?:[\.,]\d+)*)

Edit: oh, it doesn't match -30 at the beginning of a sentence. Need to fix that.

How does the num_mismatch filter tokenize?