BUG: `None` can't work with `process.cdist`

Question

BUG: `None` can't work with `process.cdist`

Zeroto521 opened this issue 2 years ago · comments

None will fail at process.cdist.
But None is okay to fuzz.ratio.

>>> from rapidfuzz import process
>>> process.cdist(
...     ["hello", "world"],
...     ["hi", None],
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Software\miniforge3\envs\dtoolkit\lib\site-packages\rapidfuzz\process_cpp.py", line 73, in cdist
    _cdist(
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1508, in rapidfuzz.process_cpp_impl.cdist
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1393, in rapidfuzz.process_cpp_impl.cdist_two_lists
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1321, in rapidfuzz.process_cpp_impl.preprocess
  File "./src/rapidfuzz/cpp_common.pxd", line 332, in cpp_common.conv_sequence
  File "./src/rapidfuzz/cpp_common.pxd", line 300, in cpp_common.hash_sequence
TypeError: object of type 'NoneType' has no len()

40% · Answer 1 · Sun Nov 27 2022 17:19:12 GMT+0800 (China Standard Time)

The core problem is that process.cdist can't work well with wrapper.

>>> from rapidfuzz import fuzz
>>> from functools import wraps
>>> import pandas as pd

>>> def check_nan(func):
...     @wraps(func)
...     def decorator(*args, **kwargs):
...         return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
...     return decorator
...
>>> process.cdist(
...     ["hello", "world"],
...     ["hi", float("nan")],
...     scorer=check_nan(fuzz.ratio),
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Software\miniforge3\envs\dtoolkit\lib\site-packages\rapidfuzz\process_cpp.py", line 73, in cdist
    _cdist(
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1508, in rapidfuzz.process_cpp_impl.cdist
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1393, in rapidfuzz.process_cpp_impl.cdist_two_lists
  File "src/rapidfuzz/process_cpp_impl.pyx", line 1321, in rapidfuzz.process_cpp_impl.preprocess
  File "./src/rapidfuzz/cpp_common.pxd", line 332, in cpp_common.conv_sequence
  File "./src/rapidfuzz/cpp_common.pxd", line 300, in cpp_common.hash_sequence
TypeError: object of type 'float' has no len()

Max Bachmann · Answer 2 · Sun Nov 27 2022 19:45:45 GMT+0800 (China Standard Time)

Hm true there are a couple of things that could be changed:

None should probably be handled for the fuzz module when used with process.cdist. The core problem here is that for a lot of scorers like Levenshtein.distance it is completely unclear what the score should be for None. So this requires special handling for functions where this is possible.
the handling of pure Python functions in process.cdist is certainly a bug, that should be fixed. Note however, that using a pure Python function with process.cdist will basically remove all performance benefits it has for the integrated scorers.

Max Bachmann · Answer 3 · Sun Nov 27 2022 20:33:29 GMT+0800 (China Standard Time)

I store some things inside the __dict__ of scorers, which should not be copied for wrapped functions. For this reason you would need to disable the dict copy:

>>> def check_nan(func):
...     @wraps(func, updated=())
...     def decorator(*args, **kwargs):
...         return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
...     return decorator
... 
>>> process.cdist(
...     ["hello", "world"],
...     ["hi", float("nan")],
...     scorer=check_nan(fuzz.ratio),
... )
array([[28.571428,  0.      ],
       [ 0.      ,  0.      ]], dtype=float32)

There is not really anything I can do about this breakage.

Max Bachmann · Answer 4 · Sat Dec 10 2022 01:24:08 GMT+0800 (China Standard Time)

In fact I was able to get this to work. I do now validate, whether the function was wrapped by storing the original function pointer inside the functions attributes as well. So this does work now:

>>> from rapidfuzz import fuzz, process
>>> from functools import wraps
>>> import pandas as pd
>>> 
>>> def check_nan(func):
...     @wraps(func)
...     def decorator(*args, **kwargs):
...         return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
...     return decorator
... 
>>> process.cdist(
...     ["hello", "world"],
...     ["hi", float("nan")],
...     scorer=check_nan(fuzz.ratio),
... )
array([[28.571428,  0.      ],
       [ 0.      ,  0.      ]], dtype=float32)

So the only remaining issue that needs to be handled, is support for None/nan in process.cdist in cases where it is supported by the corresponding scorer.

Max Bachmann · Answer 5 · Mon Apr 17 2023 04:39:30 GMT+0800 (China Standard Time)

I added support for None and float("nan") to process.cdist for scorers which support it (all normalized scorers):

>>> process.cdist(["example1", "example2"], ["example2", None])
array([[ 87.5,   0. ],
       [100. ,   0. ]], dtype=float32)
>>> process.cdist(["example1", "example2"], ["example2", float("nan")])
array([[ 87.5,   0. ],
       [100. ,   0. ]], dtype=float32)