cdist Levenshtein score_cutoff problem in v2.3.0 or older
peterjc opened this issue · comments
Test case, extracted from one of my own tool's tests which was failing when trying to use cdist
for efficiency:
from rapidfuzz import __version__
from rapidfuzz.distance import Levenshtein
from rapidfuzz.process import cdist
print(f"Using RapidFuzz {__version__}")
a_list = [
"GGTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGTGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
]
b_list = [
"TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGTGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
"TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
"TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
"TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAACTTTCCACGTGAACCGTATCAACCCATTTAGTTGGGGCTTGCTCGGGTGGCTGGCTGTCGATGTCAAAGTTGACGGCTGCTGCTGTGTGGCGGGCCCTATCATGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCCAGTAGCCCTCTCTTTTAAACCCATTCTTGAATACTGAATATACT",
]
for threshold in (None, 2, 3, 4, 5):
dists = cdist(
a_list,
b_list,
scorer=Levenshtein.distance,
score_cutoff=threshold,
)
if threshold:
expt = [
[min(Levenshtein.distance(a, b), threshold + 1) for b in b_list]
for a in a_list
]
else:
expt = [[Levenshtein.distance(a, b) for b in b_list] for a in a_list]
print(f"Using score_cutoff={threshold}: {dists}, expected {expt}")
Good output:
Using RapidFuzz 2.6.1
Using score_cutoff=None: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=2: [[2 3 3 3]], expected [[2, 3, 3, 3]]
Using score_cutoff=3: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=4: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=5: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Bad output:
Using RapidFuzz 2.2.0
Using score_cutoff=None: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=2: [[2 3 3 3]], expected [[2, 3, 3, 3]]
Using score_cutoff=3: [[2 4 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=4: [[2 5 3 4]], expected [[2, 4, 3, 4]]
Using score_cutoff=5: [[2 6 3 4]], expected [[2, 4, 3, 4]]
The last two lines have wrongly got the second distance as score_cutoff+1
rather than 4
.
Seems was fixed between version 2.3.0 (broken) and 2.4.1 (working).
I'm guessing this is what "fix banded Levenshtein implementation" meant in the change log, but I couldn't spot an issue logged about that? If so, I can just set a minimum version to avoid the issue.
Or might I have stumbled on something slightly different?
I'm guessing this is what "fix banded Levenshtein implementation" meant in the change log, but I couldn't spot an issue logged about that?
Yes this should be the issue. I disabled the broken implementation in rapidfuzz/rapidfuzz-cpp@ea90dd4 with some tests and added a fixed implementation in rapidfuzz/rapidfuzz-cpp@1406b08.
I could have probably been more clear in the release notes, that the bug could lead to incorrect results when score_cutoff in range(4, 32)
.
As you have discovered this should be fixed by setting the minimum version to >=2.4.0.
Perfect - that explains this behaviour, and setting the minimum version is no problem. Thank you!