Benchmark

Question

Benchmark

mirix opened this issue 5 months ago · comments

This is not an issue, but to also tell about the positive things, when comparing the execution time on 10 million string pairs (2-22 character long) stored on a Pandas dataframe, RapidFuzz's Damerau-Levenshtein beats both pyxDamerauLevenshtein's and jelyfish's.

Interestingly, if you run one string pair only, pyxDamerauLevenshtein is considerably faster, but, after only a few string pairs, RapidFuzz catches up and beats it.

Max Bachmann · Answer 1 · Thu Feb 22 2024 18:29:02 GMT+0800 (China Standard Time)

Glad to hear that the library proves to be useful :)

Do I understand it correctly, that you have essentially two columns A and B, where in each row you compare A with B?

mirix · Answer 2 · Thu Feb 22 2024 18:53:53 GMT+0800 (China Standard Time)

Not exactly. On the one hand, I have lists of named entities extracted from different text. On the other, multiple columns from a PostgreSQL table, each containing a dictionary with multiple combinations of keywords. The algorithm needs to find the best match and provide a similarity score.

If the we are talking about written text, I am using the Damerau-Levenshtein similarity score for glyphs only.

If one the other hand, the text are transcripts, I am using the Levenshtein similarity score both for glyphs and metaphone3.

In summary, it is a many vs many comparison run in parallel.

This is the real scenario. The Pandas dataframe was just for testing, but it was also many vs many.

Serg Gini · Answer 3 · Fri Apr 19 2024 03:31:25 GMT+0800 (China Standard Time)

Did you try to compare with https://github.com/ashvardanian/StringZilla ?

Max Bachmann · Answer 4 · Fri Apr 19 2024 08:06:05 GMT+0800 (China Standard Time)

Not in depth. However I just had a quick look at their two benchmarks:

Long sequences

This benchmark is mentioned in the readme and mentions them being significantly faster than other.

Several Python libraries provide edit distance computation. Most of them are implemented in C, but are not always as fast as StringZilla. Taking a 1'000 long proteins around 10'000 characters long, computing just a 100 distances:
JellyFish: 62.3s
EditDistance: 32.9s
StringZilla: 0.8s

Usually I would assume that they simply didn't know about better implementations. However their benchmark script (https://github.com/ashvardanian/StringZilla/blob/main/scripts/bench_similarity.ipynb) actually includes e.g. rapidfuzz as well. Which if I run their benchmark is significantly faster on my machine:

StringZilla: 8.8 sec
jellyfish: 38.5 sec
editdistance: 12.4 sec
rapidfuzz: 0.35 sec

In addition StringZilla is a lot slower than their reported number, while all other libraries are actually faster on my machine. So without them mentioning the CPU they tested this on (maybe one with AVX512?) I am questioning their results. In addition I find it questionable that they only publish results for the libraries they find to be much slower 🤷‍♂️

Short Strings

For their short string test they actually perform better than rapidfuzz. However this benchmark takes most time just to select the random strings, which is completely unusable as a benchmark. So as a first step I factored this out:

queries = [random.choice(words) for _ in range(1000000)]
choices = [random.choice(words) for _ in range(1000000)]

def checksum_distances(tokens, distance_function, n: int = 1000000):
    distances_sum = 0
    for a, b in zip(queries, choices):
        distances_sum += distance_function(a, b)
    return distances_sum

This has the additional advantage that everyone now compares the same strings. This gives the following results:

StringZilla: 310 ms
jellyfish: 1.8 sec
editdistance: 732 ms
rapidfuzz: 388 ms

When comparing only empty strings I get:

StringZilla: 230 ms
jellyfish: 220 sec
editdistance: 192 ms
rapidfuzz: 276 ms

Maybe there are things in the wrapping code which could be improved a bit in rapidfuzz to get the same results 🤔
Overall people should try to use the process module as much as possible when working with tons of short strings. E.g. when comparing each element of two lists with each other:

queries = [random.choice(words) for _ in range(10000)]
choices = [random.choice(words) for _ in range(10000)]

def checksum_distances(tokens, distance_function, n: int = 1000000):
    for a in queries:
        for b in choices:
            distance_function(a, b)

rapidfuzz can do a lot better by using process.cdist(queries, choices):

StringZilla: 25.3 sec
RapidFuzz: 23.8 sec
RapidFuzz (cdist): 1.28 sec