Weird score using token set ratio with cdist

Question

Weird score using token set ratio with cdist

arebouillet opened this issue a year ago · comments

Before starting i would like to let you know your package is awesome, I used to be a fuzzywuzzy user so I definitely see what your implementation brings in terms of performance !
However I am facing an issue, weird results using the cdist operation.
See below, I think the code talks by itself:

>>> adr1
'188 RUE DU FAUBOURG SAINT ANTOINE'
>>> adr2
'188 RUE DU FAUBOURG SAINT-ANTOINE'
>>> fuzz.token_set_ratio(adr1,adr2)
100.0
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio)
array([[81.818184]], dtype=float32)

Am I doing something wrong? I am using the 2.15.1 version in a poetry venv and Python 3.8.10

Max Bachmann · Answer 1 · Tue Apr 18 2023 00:03:29 GMT+0800 (China Standard Time)

The difference lies in the preprocessing. Until v3.0.0 the signatures are:

fuzz.token_set_ratio(..., processor=utils.default_process)
process.cdist(..., processor=None)

cdist did always call the passed scorer with processor=None and just execute the one passed to it. Since this can be surprising this was changed in v3.0.0:

every function now defaults to processor=None
the process module no longer calls the scorer with processor=None

So in v3.0.0 you will get:

>>> adr1
'188 RUE DU FAUBOURG SAINT ANTOINE'
>>> adr2
'188 RUE DU FAUBOURG SAINT-ANTOINE'
>>> fuzz.token_set_ratio(adr1,adr2)
81.818184
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio)
array([[81.818184]], dtype=float32)
>>> fuzz.token_set_ratio(adr1,adr2, processor=utils.default_process)
100.0
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio, processor=utils.default_process)
array([[100.0]], dtype=float32)

arebouillet · Answer 2 · Tue Apr 18 2023 00:09:57 GMT+0800 (China Standard Time)

Pretty clear, thank you very much