rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent results for token_ratio between 2.15.1 and 3.0.0

alonshalita opened this issue · comments

Hi,

token_ratio returns inconsistent results when migrating from 2.15.1 to 3.0.0 (or later releases). See for example

Python 3.11.2 (main, Mar 24 2023, 00:28:48) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rapidfuzz
>>> rapidfuzz.__version__
'2.15.1'
>>> rapidfuzz.fuzz.token_ratio(
...   "did lincoln. sin the national, banking act of 1863?",
...   "Did Lincoln sign the National Banking Act of 1863?")
98.96907216494846

and

Python 3.11.2 (main, Mar 24 2023, 00:28:48) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rapidfuzz
>>> rapidfuzz.__version__
'3.0.0'
>>> rapidfuzz.fuzz.token_ratio(
...   "did lincoln. sin the national, banking act of 1863?",
...   "Did Lincoln sign the National Banking Act of 1863?")
87.12871287128714

Is this a bug, or an expected change? I couldn't seem to find anything related in the changelog.

In version 3.0.0 all scorers use processor=None as default to make the default more consistent. Previously some of them did use processor=utils.default_process. This changes the results in your case since the strings are no longer preprocessed. You can manually reenable the preprocessing:

>>> import rapidfuzz
>>> rapidfuzz.__version__
'3.1.0'
>>> rapidfuzz.fuzz.token_ratio(
...     "did lincoln. sin the national, banking act of 1863?",
...     "Did Lincoln sign the National Banking Act of 1863?",
...     processor=rapidfuzz.utils.default_process)
98.96907216494846

In the changelog this is mentioned as:

update defaults of the processor argument to be None everywhere. This changes the defaults of some of
the functions in rapidfuzz.fuzz and rapidfuzz.process.

Thanks for the clarification. Can you tell which functions had their default processor changed?

I updated the changelog to mention this might change the results, how to get back the old behaviour and which functions are affected: https://github.com/maxbachmann/RapidFuzz/releases/tag/v3.0.0.

Affected function are:

  • process.extract, process.extract_iter, process.extractOne
  • fuzz.token_sort_ratio, fuzz.token_set_ratio, fuzz.token_ratio, fuzz.partial_token_sort_ratio, fuzz.partial_token_set_ratio, fuzz.partial_token_ratio, fuzz.WRatio, fuzz.QRatio