rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BUG: `None` can't work with `process.cdist`

Zeroto521 opened this issue · comments

commented

None will fail at process.cdist.
But None is okay to fuzz.ratio.

>>> from rapidfuzz import process
>>> process.cdist(
...     ["hello", "world"],
...     ["hi", None],
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Software\miniforge3\envs\dtoolkit\lib\site-packages\rapidfuzz\process_cpp.py", line 73, in cdist
    _cdist(
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1508, in rapidfuzz.process_cpp_impl.cdist
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1393, in rapidfuzz.process_cpp_impl.cdist_two_lists
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1321, in rapidfuzz.process_cpp_impl.preprocess
  File "./src/rapidfuzz/cpp_common.pxd", line 332, in cpp_common.conv_sequence
  File "./src/rapidfuzz/cpp_common.pxd", line 300, in cpp_common.hash_sequence
TypeError: object of type 'NoneType' has no len()
commented

The core problem is that process.cdist can't work well with wrapper.

>>> from rapidfuzz import fuzz
>>> from functools import wraps
>>> import pandas as pd

>>> def check_nan(func):
...     @wraps(func)
...     def decorator(*args, **kwargs):
...         return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
...     return decorator
...
>>> process.cdist(
...     ["hello", "world"],
...     ["hi", float("nan")],
...     scorer=check_nan(fuzz.ratio),
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Software\miniforge3\envs\dtoolkit\lib\site-packages\rapidfuzz\process_cpp.py", line 73, in cdist
    _cdist(
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1508, in rapidfuzz.process_cpp_impl.cdist
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1393, in rapidfuzz.process_cpp_impl.cdist_two_lists
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1321, in rapidfuzz.process_cpp_impl.preprocess
  File "./src/rapidfuzz/cpp_common.pxd", line 332, in cpp_common.conv_sequence
  File "./src/rapidfuzz/cpp_common.pxd", line 300, in cpp_common.hash_sequence
TypeError: object of type 'float' has no len()

Hm true there are a couple of things that could be changed:

  1. None should probably be handled for the fuzz module when used with process.cdist. The core problem here is that for a lot of scorers like Levenshtein.distance it is completely unclear what the score should be for None. So this requires special handling for functions where this is possible.
  2. the handling of pure Python functions in process.cdist is certainly a bug, that should be fixed. Note however, that using a pure Python function with process.cdist will basically remove all performance benefits it has for the integrated scorers.

I store some things inside the __dict__ of scorers, which should not be copied for wrapped functions. For this reason you would need to disable the dict copy:

>>> def check_nan(func):
...     @wraps(func, updated=())
...     def decorator(*args, **kwargs):
...         return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
...     return decorator
... 
>>> process.cdist(
...     ["hello", "world"],
...     ["hi", float("nan")],
...     scorer=check_nan(fuzz.ratio),
... )
array([[28.571428,  0.      ],
       [ 0.      ,  0.      ]], dtype=float32)

There is not really anything I can do about this breakage.

In fact I was able to get this to work. I do now validate, whether the function was wrapped by storing the original function pointer inside the functions attributes as well. So this does work now:

>>> from rapidfuzz import fuzz, process
>>> from functools import wraps
>>> import pandas as pd
>>> 
>>> def check_nan(func):
...     @wraps(func)
...     def decorator(*args, **kwargs):
...         return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
...     return decorator
... 
>>> process.cdist(
...     ["hello", "world"],
...     ["hi", float("nan")],
...     scorer=check_nan(fuzz.ratio),
... )
array([[28.571428,  0.      ],
       [ 0.      ,  0.      ]], dtype=float32)

So the only remaining issue that needs to be handled, is support for None/nan in process.cdist in cases where it is supported by the corresponding scorer.

I added support for None and float("nan") to process.cdist for scorers which support it (all normalized scorers):

>>> process.cdist(["example1", "example2"], ["example2", None])
array([[ 87.5,   0. ],
       [100. ,   0. ]], dtype=float32)
>>> process.cdist(["example1", "example2"], ["example2", float("nan")])
array([[ 87.5,   0. ],
       [100. ,   0. ]], dtype=float32)