Tips for speeding up process.cdist for matching very large lists?

Question

Tips for speeding up process.cdist for matching very large lists?

panosk opened this issue 2 years ago · comments

Hello,

Thanks for your awesome work!

I need to match very large lists of strings (hundreds of thousands, even millions) against other large lists (200k-500k strings) so I'm using process.cdist but I can't get the desired performance. Are there any tips or suggestions for speeding up the process? String length can be arbitrary, but usually > 50 chars. This is how I call cdist:

process.cdist(query_list,
              matching_list,
              scorer=fuzz.ratio,
              score_cutoff=70,
              workers=-1)

Thanks!

Max Bachmann · Answer 1 · Wed Nov 30 2022 06:10:21 GMT+0800 (China Standard Time)

I need to match very large lists of strings (hundreds of thousands, even millions) against other large lists (200k-500k strings) so I'm using process.cdist but I can't get the desired performance

Not sure what your performance target is, but this could mean more than 100 billion text comparisons

String length can be arbitrary, but usually > 50 chars

On recent x64 CPUs with AVX2 this should still compare around 4 strings in parallel using SIMD for all texts with <= 64 characters. I do not have a SIMD implementation for longer texts yet, but it would probably be worth adding one especially for AVX2 at some point.

score_cutoff=70

The score_cutoff will sadly not really help in your case. It will simply test the result after the string matching. In regards to performance of fuzz.ratio it only helps in two cases:

when this means only a very small amount of edits (<= 4) is allowed. In this case it is worth it to simply brute force all possible combinations of edits below 4.
when the texts are very long, since in this case it allows the usage of Ukkonens algorithm to calculate only a smaller band inside the Levenshtein matrix. This only saves much for strings with lengths > 1k characters

In your case given a length of 50, score_cutoff=70 still allows up to 30 edits (15 substitutions).

Improving performance

This should already be pretty much the optimal way to call cdist (just make sure the result table fits into ram). There are a couple of ways to improve the performance in theory, but none of them are available in rapidfuzz right now.

Many metrics have the property of triangle inequality, which means that dist(A, B) + dist(B, C) <= dist(C, A). Some users are only interested in a boolean result (>= score_cutoff) and not a precise score. In this case it should be possible to make use of the triangle inequality to save some comparisons.
BK-trees should be helpful especially for very close matches. However I do not have an implementation of it so far.
depending on the data you might be aware of some other algorithm which allows for faster filtering ahead of time.
obviously you can always throw more hardware on the problem 😉

I think especially adding support for BK trees to rapidfuzz would make a lot of sense.

Panos Kanavos · Answer 2 · Wed Nov 30 2022 20:05:07 GMT+0800 (China Standard Time)

Thank you for your detailed feedback.

I realize this kind of matching will take a considerable amount of time nevertheless.
It seems filtering the strings to a maximum length is the only approach to speed up the process.