Significant performance drop in v2.3.1 as compared to v1.7.0

Question

Significant performance drop in v2.3.1 as compared to v1.7.0

NikhilKothari opened this issue 2 years ago · comments

Nikhil Kothari commented 2 years ago

Hi @maxbachmann!

There seems to be a 5x drop in performance of cdist function using partial_ratio.

v1.7.0

v2.3.1

Please let me know if there is something that I am missing which might be causing this issue.
Thanks in advance!

Max Bachmann · Answer 1 · Wed Nov 16 2022 16:07:56 GMT+0800 (China Standard Time)

I am able to reproduce this regression and was able to find the commit introducing it. I still need to find the exact reason, but it appear to be introduced by rapidfuzz/rapidfuzz-cpp@v1.4.1...v1.5.0

Max Bachmann · Answer 2 · Sat Nov 19 2022 19:08:24 GMT+0800 (China Standard Time)

A couple of reasons for the performance regression I found so far:

change in sliding window algorithm

Before this change I simply iterated over all possible substrings and compared them. In your case this can exit very early since both Daft and Punk are at the start of the longer sequence.

The new implementation uses a different pattern for this search, which allows the algorithm to skip the calculation for more substrings. This improves the performance, but in your case this leads to the perfect match being found later.

E.g. for the following benchmarks

choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Daft"], choices, scorer=fuzz.partial_ratio)

choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Punk"], choices, scorer=fuzz.partial_ratio)

choices = ["Daft Punk were a French electronic music duo formed in 1993 in Paris"]*1000
process.cdist(["Pari"], choices, scorer=fuzz.partial_ratio)

I receive the following results in 1.7.1:

0.09788650699920254
0.14970400799938943
0.6988558380035101

and the following in the current version:

0.09593261699774303
0.701189241000975
0.09907293100695824

correctness fix

Historically I implemented partial_ratio in the same way as fuzzywuzzy for long sequences (>64 characters). They search for the longest common substrings and use those as starting point for further comparisions. This reduces the search space a lot, making it faster. However the results are simply incorrect in a lot of cases.

Max Bachmann · Answer 3 · Tue Dec 13 2022 03:35:05 GMT+0800 (China Standard Time)

The new search pattern for the optimal substring is faster in the average case. It is unfortunate, that this leads to a performance regression in your specific example. However I do not think there is anything that can be done to improve this while still keeping the performance improvement for the average case.