rapidfuzz not handling custom classes of sequence elements anymore
mikegerber opened this issue · comments
When running this (using a custom class of hashable sequence elements):
from rapidfuzz.distance import Levenshtein
class FuzzyString:
"""
Example string class, that matches equal if another FuzzyString is
almost equal
"""
def __init__(self, string):
self._string = string
def __eq__(self, other):
print(self, other)
# Just an example!
min_len = min(len(self._string), len(other._string))
if min_len > 0:
normalized_distance = distance(self._string, other._string) / min_len
similar = normalized_distance < 0.1
else:
similar = False
return similar
def __ne__(self, other):
return not self.__eq__(other)
def __repr__(self):
return "FuzzyString('%s')" % self._string
def __hash__(self):
return hash(self._string)
# Here we try compare sequences of lines, where lines are matched if there are
# equal or almost equal.
s1 = [
FuzzyString("This is a line."),
FuzzyString("This is another"),
FuzzyString("And the last line"),
]
s2 = [
FuzzyString("This is a ljne."),
FuzzyString("This is another"),
FuzzyString("J u n k"),
FuzzyString("And the last line"),
]
#print(Levenshtein.editops(s1, s2))
print(Levenshtein.distance(s1, s2))
I get the following output and error:
FuzzyString('This is a line.') -1
Traceback (most recent call last):
File "/home/mike/rapidfuzz-bug-custom-class.py", line 51, in <module>
print(Levenshtein.distance(s1, s2))
File "Levenshtein_cpp.pyx", line 115, in rapidfuzz.distance.Levenshtein_cpp.distance
File "cpp_common.pxd", line 377, in cpp_common.preprocess_strings
File "cpp_common.pxd", line 331, in cpp_common.conv_sequence
File "cpp_common.pxd", line 320, in cpp_common.hash_sequence
File "cpp_common.pxd", line 316, in cpp_common.hash_sequence
File "cpp_common.pxd", line 245, in cpp_common.rf_hash
File "/home/mike/rapidfuzz-bug-custom-class.py", line 16, in __eq__
min_len = min(len(self._string), len(other._string))
rapidfuzz 2.5.0 seems to be (trying to) comparing an object with a -1 instead of another object
A little background: We used this kind of code to align lines of OCR vs lines of ground truth text. This is another use case than that of dinglehopper (there we align characters not lines).
I noticed the example code makes no sense as such, but the bug is still there, I think.
2.5.0 FAIL
2.4.3 FAIL
2.3.0 FAIL
2.2.0 FAIL
2.1.4 FAIL
2.0.15 OK
thanks for reporting. Fixed this in 3b6f22f
Fantastic and thanks for the rapid bug fixing :)