Significant performance drop in v2.3.1 as compared to v1.7.0
NikhilKothari opened this issue · comments
Hi @maxbachmann!
There seems to be a 5x drop in performance of cdist
function using partial_ratio
.
v1.7.0
v2.3.1
Please let me know if there is something that I am missing which might be causing this issue.
Thanks in advance!
I am able to reproduce this regression and was able to find the commit introducing it. I still need to find the exact reason, but it appear to be introduced by rapidfuzz/rapidfuzz-cpp@v1.4.1...v1.5.0
A couple of reasons for the performance regression I found so far:
change in sliding window algorithm
Before this change I simply iterated over all possible substrings and compared them. In your case this can exit very early since both Daft
and Punk
are at the start of the longer sequence.
The new implementation uses a different pattern for this search, which allows the algorithm to skip the calculation for more substrings. This improves the performance, but in your case this leads to the perfect match being found later.
E.g. for the following benchmarks
choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Daft"], choices, scorer=fuzz.partial_ratio)
choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Punk"], choices, scorer=fuzz.partial_ratio)
choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Pari"], choices, scorer=fuzz.partial_ratio)
I receive the following results in 1.7.1:
0.09788650699920254
0.14970400799938943
0.6988558380035101
and the following in the current version:
0.09593261699774303
0.701189241000975
0.09907293100695824
correctness fix
Historically I implemented partial_ratio
in the same way as fuzzywuzzy
for long sequences (>64 characters). They search for the longest common substrings and use those as starting point for further comparisions. This reduces the search space a lot, making it faster. However the results are simply incorrect in a lot of cases.
The new search pattern for the optimal substring is faster in the average case. It is unfortunate, that this leads to a performance regression in your specific example. However I do not think there is anything that can be done to improve this while still keeping the performance improvement for the average case.