Weird score using token set ratio with cdist
arebouillet opened this issue · comments
Before starting i would like to let you know your package is awesome, I used to be a fuzzywuzzy user so I definitely see what your implementation brings in terms of performance !
However I am facing an issue, weird results using the cdist operation.
See below, I think the code talks by itself:
>>> adr1
'188 RUE DU FAUBOURG SAINT ANTOINE'
>>> adr2
'188 RUE DU FAUBOURG SAINT-ANTOINE'
>>> fuzz.token_set_ratio(adr1,adr2)
100.0
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio)
array([[81.818184]], dtype=float32)
Am I doing something wrong? I am using the 2.15.1 version in a poetry venv and Python 3.8.10
The difference lies in the preprocessing. Until v3.0.0
the signatures are:
fuzz.token_set_ratio(..., processor=utils.default_process)
process.cdist(..., processor=None)
cdist did always call the passed scorer
with processor=None
and just execute the one passed to it. Since this can be surprising this was changed in v3.0.0
:
- every function now defaults to
processor=None
- the process module no longer calls the scorer with
processor=None
So in v3.0.0 you will get:
>>> adr1
'188 RUE DU FAUBOURG SAINT ANTOINE'
>>> adr2
'188 RUE DU FAUBOURG SAINT-ANTOINE'
>>> fuzz.token_set_ratio(adr1,adr2)
81.818184
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio)
array([[81.818184]], dtype=float32)
>>> fuzz.token_set_ratio(adr1,adr2, processor=utils.default_process)
100.0
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio, processor=utils.default_process)
array([[100.0]], dtype=float32)
Pretty clear, thank you very much