rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Significant performance drop in v2.3.1 as compared to v1.7.0

NikhilKothari opened this issue · comments

Hi @maxbachmann!

There seems to be a 5x drop in performance of cdist function using partial_ratio.

v1.7.0

Screenshot 2022-11-16 at 12 40 12 PM

v2.3.1

Screenshot 2022-11-16 at 12 37 19 PM

Please let me know if there is something that I am missing which might be causing this issue.
Thanks in advance!

I am able to reproduce this regression and was able to find the commit introducing it. I still need to find the exact reason, but it appear to be introduced by rapidfuzz/rapidfuzz-cpp@v1.4.1...v1.5.0

A couple of reasons for the performance regression I found so far:

change in sliding window algorithm

Before this change I simply iterated over all possible substrings and compared them. In your case this can exit very early since both Daft and Punk are at the start of the longer sequence.

The new implementation uses a different pattern for this search, which allows the algorithm to skip the calculation for more substrings. This improves the performance, but in your case this leads to the perfect match being found later.

E.g. for the following benchmarks

choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Daft"], choices, scorer=fuzz.partial_ratio)

choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Punk"], choices, scorer=fuzz.partial_ratio)

choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Pari"], choices, scorer=fuzz.partial_ratio)

I receive the following results in 1.7.1:

0.09788650699920254
0.14970400799938943
0.6988558380035101

and the following in the current version:

0.09593261699774303
0.701189241000975
0.09907293100695824

correctness fix

Historically I implemented partial_ratio in the same way as fuzzywuzzy for long sequences (>64 characters). They search for the longest common substrings and use those as starting point for further comparisions. This reduces the search space a lot, making it faster. However the results are simply incorrect in a lot of cases.

The new search pattern for the optimal substring is faster in the average case. It is unfortunate, that this leads to a performance regression in your specific example. However I do not think there is anything that can be done to improve this while still keeping the performance improvement for the average case.