rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Benchmark

mirix opened this issue · comments

This is not an issue, but to also tell about the positive things, when comparing the execution time on 10 million string pairs (2-22 character long) stored on a Pandas dataframe, RapidFuzz's Damerau-Levenshtein beats both pyxDamerauLevenshtein's and jelyfish's.

Interestingly, if you run one string pair only, pyxDamerauLevenshtein is considerably faster, but, after only a few string pairs, RapidFuzz catches up and beats it.

Glad to hear that the library proves to be useful :)

Do I understand it correctly, that you have essentially two columns A and B, where in each row you compare A with B?

Not exactly. On the one hand, I have lists of named entities extracted from different text. On the other, multiple columns from a PostgreSQL table, each containing a dictionary with multiple combinations of keywords. The algorithm needs to find the best match and provide a similarity score.

If the we are talking about written text, I am using the Damerau-Levenshtein similarity score for glyphs only.

If one the other hand, the text are transcripts, I am using the Levenshtein similarity score both for glyphs and metaphone3.

In summary, it is a many vs many comparison run in parallel.

This is the real scenario. The Pandas dataframe was just for testing, but it was also many vs many.

Not in depth. However I just had a quick look at their two benchmarks:

Long sequences

This benchmark is mentioned in the readme and mentions them being significantly faster than other.

Several Python libraries provide edit distance computation. Most of them are implemented in C, but are not always as fast as StringZilla. Taking a 1'000 long proteins around 10'000 characters long, computing just a 100 distances:
JellyFish: 62.3s
EditDistance: 32.9s
StringZilla: 0.8s

Usually I would assume that they simply didn't know about better implementations. However their benchmark script (https://github.com/ashvardanian/StringZilla/blob/main/scripts/bench_similarity.ipynb) actually includes e.g. rapidfuzz as well. Which if I run their benchmark is significantly faster on my machine:

  • StringZilla: 8.8 sec
  • jellyfish: 38.5 sec
  • editdistance: 12.4 sec
  • rapidfuzz: 0.35 sec

In addition StringZilla is a lot slower than their reported number, while all other libraries are actually faster on my machine. So without them mentioning the CPU they tested this on (maybe one with AVX512?) I am questioning their results. In addition I find it questionable that they only publish results for the libraries they find to be much slower 🤷‍♂️

Short Strings

For their short string test they actually perform better than rapidfuzz. However this benchmark takes most time just to select the random strings, which is completely unusable as a benchmark. So as a first step I factored this out:

queries = [random.choice(words) for _ in range(1000000)]
choices = [random.choice(words) for _ in range(1000000)]

def checksum_distances(tokens, distance_function, n: int = 1000000):
    distances_sum = 0
    for a, b in zip(queries, choices):
        distances_sum += distance_function(a, b)
    return distances_sum

This has the additional advantage that everyone now compares the same strings. This gives the following results:

  • StringZilla: 310 ms
  • jellyfish: 1.8 sec
  • editdistance: 732 ms
  • rapidfuzz: 388 ms

When comparing only empty strings I get:

  • StringZilla: 230 ms
  • jellyfish: 220 sec
  • editdistance: 192 ms
  • rapidfuzz: 276 ms

Maybe there are things in the wrapping code which could be improved a bit in rapidfuzz to get the same results 🤔
Overall people should try to use the process module as much as possible when working with tons of short strings. E.g. when comparing each element of two lists with each other:

queries = [random.choice(words) for _ in range(10000)]
choices = [random.choice(words) for _ in range(10000)]

def checksum_distances(tokens, distance_function, n: int = 1000000):
    for a in queries:
        for b in choices:
            distance_function(a, b)

rapidfuzz can do a lot better by using process.cdist(queries, choices):

  • StringZilla: 25.3 sec
  • RapidFuzz: 23.8 sec
  • RapidFuzz (cdist): 1.28 sec