How does the num_mismatch filter tokenize?
kpu opened this issue · comments
The num_mismatch filter threw out this sentence pair:
"Procès-verbal of rectification to the Convention on jurisdiction and the recognition and enforcement of judgments in civil and commercial matters, signed at Lugano on 30 October 2007"
"Procès–verbal ta’ Rettifika tal-Konvenzjoni dwar il-ġurisdizzjoni u r-rikonoxximent u l-eżekuzzjoni ta’ sentenzi f’materji ċivili u kummerċjali, iffirmat f’Lugano fit-30 ta’ Ottubru 2007"
It kept this sentence pair:
"(Official Journal of the European Union L 147 of 10 June 2009)"
"(Il-Ġurnal Uffiċjali tal-Unjoni Ewropea L 147 tal-10 ta’ Ġunju 2009)"
Unsure why "tal-10" was ok to match 10 but "fit-30" was not ok to match 30.
Heh, it doesn't. It just searches for numbers:
The first one isn't accepted because 2007
matches, but 30
and -30
doesn't: 1 / 2 < 1.0
The second is matching: 147
, 2009
vs non-matching: 10
, -10
: 2 / 2 >= 1.0
so it is accepted.
I'm a bit unhappy about how the ratio thing makes it less predictable than if it where just a simple yes/no, but I didn't want to throw out sentences with a bunch of good number examples just if it gone a single one wrong.
Since dashes are used for both signs and punctuation I'm tempted to get rid of the the sign has to match rule that I recently added.
Proposing to replace the regexp with (?P<sign>(?<=\s)[-+])?(?:0*)(?P<value>\d+(?:[\.,]\d+)*)\b
:
- matches
30
infit-30
- but
-30
init is -30 degrees
- does not match
40something
For context: the current expression is (?P<sign>[-+]?)(?:0*)(?P<value>\d+(?:[\.,]\d+)*)
Edit: oh, it doesn't match -30
at the beginning of a sentence. Need to fix that.