rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rapidfuzz not handling custom classes of sequence elements anymore

mikegerber opened this issue · comments

When running this (using a custom class of hashable sequence elements):

from rapidfuzz.distance import Levenshtein


class FuzzyString:
    """
    Example string class, that matches equal if another FuzzyString is
    almost equal
    """

    def __init__(self, string):
        self._string = string

    def __eq__(self, other):
        print(self, other)
        # Just an example!
        min_len = min(len(self._string), len(other._string))
        if min_len > 0:
            normalized_distance = distance(self._string, other._string) / min_len
            similar = normalized_distance < 0.1
        else:
            similar = False
        return similar

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "FuzzyString('%s')" % self._string

    def __hash__(self):
        return hash(self._string)



# Here we try compare sequences of lines, where lines are matched if there are
# equal or almost equal.
s1 = [
        FuzzyString("This is a line."),
        FuzzyString("This is another"),
        FuzzyString("And the last line"),
     ]
s2 = [
        FuzzyString("This is a ljne."),
        FuzzyString("This is another"),
        FuzzyString("J  u   n      k"),
        FuzzyString("And the last line"),
     ]


#print(Levenshtein.editops(s1, s2))
print(Levenshtein.distance(s1, s2))

I get the following output and error:

FuzzyString('This is a line.') -1
Traceback (most recent call last):
  File "/home/mike/rapidfuzz-bug-custom-class.py", line 51, in <module>
    print(Levenshtein.distance(s1, s2))
  File "Levenshtein_cpp.pyx", line 115, in rapidfuzz.distance.Levenshtein_cpp.distance
  File "cpp_common.pxd", line 377, in cpp_common.preprocess_strings
  File "cpp_common.pxd", line 331, in cpp_common.conv_sequence
  File "cpp_common.pxd", line 320, in cpp_common.hash_sequence
  File "cpp_common.pxd", line 316, in cpp_common.hash_sequence
  File "cpp_common.pxd", line 245, in cpp_common.rf_hash
  File "/home/mike/rapidfuzz-bug-custom-class.py", line 16, in __eq__
    min_len = min(len(self._string), len(other._string))

rapidfuzz 2.5.0 seems to be (trying to) comparing an object with a -1 instead of another object

A little background: We used this kind of code to align lines of OCR vs lines of ground truth text. This is another use case than that of dinglehopper (there we align characters not lines).

I noticed the example code makes no sense as such, but the bug is still there, I think.

2.5.0 FAIL
2.4.3 FAIL
2.3.0 FAIL
2.2.0 FAIL
2.1.4 FAIL
2.0.15 OK

thanks for reporting. Fixed this in 3b6f22f

Fantastic and thanks for the rapid bug fixing :)