rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Weird score using token set ratio with cdist

arebouillet opened this issue · comments

Before starting i would like to let you know your package is awesome, I used to be a fuzzywuzzy user so I definitely see what your implementation brings in terms of performance !
However I am facing an issue, weird results using the cdist operation.
See below, I think the code talks by itself:

>>> adr1
'188 RUE DU FAUBOURG SAINT ANTOINE'
>>> adr2
'188 RUE DU FAUBOURG SAINT-ANTOINE'
>>> fuzz.token_set_ratio(adr1,adr2)
100.0
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio)
array([[81.818184]], dtype=float32)

Am I doing something wrong? I am using the 2.15.1 version in a poetry venv and Python 3.8.10

The difference lies in the preprocessing. Until v3.0.0 the signatures are:

fuzz.token_set_ratio(..., processor=utils.default_process)
process.cdist(..., processor=None)

cdist did always call the passed scorer with processor=None and just execute the one passed to it. Since this can be surprising this was changed in v3.0.0:

  • every function now defaults to processor=None
  • the process module no longer calls the scorer with processor=None

So in v3.0.0 you will get:

>>> adr1
'188 RUE DU FAUBOURG SAINT ANTOINE'
>>> adr2
'188 RUE DU FAUBOURG SAINT-ANTOINE'
>>> fuzz.token_set_ratio(adr1,adr2)
81.818184
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio)
array([[81.818184]], dtype=float32)
>>> fuzz.token_set_ratio(adr1,adr2, processor=utils.default_process)
100.0
>>> cdist([adr1],[adr2],scorer = fuzz.token_set_ratio, processor=utils.default_process)
array([[100.0]], dtype=float32)

Pretty clear, thank you very much