rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cdist Levenshtein score_cutoff problem in v2.3.0 or older

peterjc opened this issue · comments

Test case, extracted from one of my own tool's tests which was failing when trying to use cdist for efficiency:

from rapidfuzz import __version__
from rapidfuzz.distance import Levenshtein
from rapidfuzz.process import cdist

print(f"Using RapidFuzz {__version__}")

a_list = [
    "GGTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGTGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
]
b_list = [
    "TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGTGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
    "TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
    "TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
    "TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
]

for threshold in (None, 2, 3, 4, 5):
    dists = cdist(
        a_list,
        b_list,
        scorer=Levenshtein.distance,
        score_cutoff=threshold,
    )
    if threshold:
        expt = [
            [min(Levenshtein.distance(a, b), threshold + 1) for b in b_list]
            for a in a_list
        ]
    else:
        expt = [[Levenshtein.distance(a, b) for b in b_list] for a in a_list]
    print(f"Using score_cutoff={threshold}: {dists}, expected {expt}")

Good output:

Using RapidFuzz 2.6.1
Using score_cutoff=None: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=2: [[2 3 3 3]], expected [[2, 3, 3, 3]]
Using score_cutoff=3: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=4: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=5: [[2 4 3 4]], expected [[2, 4, 3, 4]]

Bad output:

Using RapidFuzz 2.2.0
Using score_cutoff=None: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=2: [[2 3 3 3]], expected [[2, 3, 3, 3]]
Using score_cutoff=3: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=4: [[2 5 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=5: [[2 6 3 4]], expected [[2, 4, 3, 4]]

The last two lines have wrongly got the second distance as score_cutoff+1 rather than 4.

Seems was fixed between version 2.3.0 (broken) and 2.4.1 (working).

I'm guessing this is what "fix banded Levenshtein implementation" meant in the change log, but I couldn't spot an issue logged about that? If so, I can just set a minimum version to avoid the issue.

Or might I have stumbled on something slightly different?

I'm guessing this is what "fix banded Levenshtein implementation" meant in the change log, but I couldn't spot an issue logged about that?

Yes this should be the issue. I disabled the broken implementation in rapidfuzz/rapidfuzz-cpp@ea90dd4 with some tests and added a fixed implementation in rapidfuzz/rapidfuzz-cpp@1406b08.
I could have probably been more clear in the release notes, that the bug could lead to incorrect results when score_cutoff in range(4, 32).

As you have discovered this should be fixed by setting the minimum version to >=2.4.0.

Perfect - that explains this behaviour, and setting the minimum version is no problem. Thank you!