BUG: `None` can't work with `process.cdist`
Zeroto521 opened this issue · comments
None will fail at process.cdist
.
But None is okay to fuzz.ratio
.
>>> from rapidfuzz import process
>>> process.cdist(
... ["hello", "world"],
... ["hi", None],
... )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Software\miniforge3\envs\dtoolkit\lib\site-packages\rapidfuzz\process_cpp.py", line 73, in cdist
_cdist(
File "src/rapidfuzz/process_cpp_impl.pyx", line 1508, in rapidfuzz.process_cpp_impl.cdist
File "src/rapidfuzz/process_cpp_impl.pyx", line 1393, in rapidfuzz.process_cpp_impl.cdist_two_lists
File "src/rapidfuzz/process_cpp_impl.pyx", line 1321, in rapidfuzz.process_cpp_impl.preprocess
File "./src/rapidfuzz/cpp_common.pxd", line 332, in cpp_common.conv_sequence
File "./src/rapidfuzz/cpp_common.pxd", line 300, in cpp_common.hash_sequence
TypeError: object of type 'NoneType' has no len()
The core problem is that process.cdist
can't work well with wrapper.
>>> from rapidfuzz import fuzz
>>> from functools import wraps
>>> import pandas as pd
>>> def check_nan(func):
... @wraps(func)
... def decorator(*args, **kwargs):
... return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
... return decorator
...
>>> process.cdist(
... ["hello", "world"],
... ["hi", float("nan")],
... scorer=check_nan(fuzz.ratio),
... )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Software\miniforge3\envs\dtoolkit\lib\site-packages\rapidfuzz\process_cpp.py", line 73, in cdist
_cdist(
File "src/rapidfuzz/process_cpp_impl.pyx", line 1508, in rapidfuzz.process_cpp_impl.cdist
File "src/rapidfuzz/process_cpp_impl.pyx", line 1393, in rapidfuzz.process_cpp_impl.cdist_two_lists
File "src/rapidfuzz/process_cpp_impl.pyx", line 1321, in rapidfuzz.process_cpp_impl.preprocess
File "./src/rapidfuzz/cpp_common.pxd", line 332, in cpp_common.conv_sequence
File "./src/rapidfuzz/cpp_common.pxd", line 300, in cpp_common.hash_sequence
TypeError: object of type 'float' has no len()
Hm true there are a couple of things that could be changed:
- None should probably be handled for the
fuzz
module when used withprocess.cdist
. The core problem here is that for a lot of scorers likeLevenshtein.distance
it is completely unclear what the score should be forNone
. So this requires special handling for functions where this is possible. - the handling of pure Python functions in
process.cdist
is certainly a bug, that should be fixed. Note however, that using a pure Python function withprocess.cdist
will basically remove all performance benefits it has for the integrated scorers.
I store some things inside the __dict__
of scorers, which should not be copied for wrapped functions. For this reason you would need to disable the dict copy:
>>> def check_nan(func):
... @wraps(func, updated=())
... def decorator(*args, **kwargs):
... return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
... return decorator
...
>>> process.cdist(
... ["hello", "world"],
... ["hi", float("nan")],
... scorer=check_nan(fuzz.ratio),
... )
array([[28.571428, 0. ],
[ 0. , 0. ]], dtype=float32)
There is not really anything I can do about this breakage.
In fact I was able to get this to work. I do now validate, whether the function was wrapped by storing the original function pointer inside the functions attributes as well. So this does work now:
>>> from rapidfuzz import fuzz, process
>>> from functools import wraps
>>> import pandas as pd
>>>
>>> def check_nan(func):
... @wraps(func)
... def decorator(*args, **kwargs):
... return 0 if pd.isna(args[0]) or pd.isna(args[1]) else func(*args, **kwargs)
... return decorator
...
>>> process.cdist(
... ["hello", "world"],
... ["hi", float("nan")],
... scorer=check_nan(fuzz.ratio),
... )
array([[28.571428, 0. ],
[ 0. , 0. ]], dtype=float32)
So the only remaining issue that needs to be handled, is support for None
/nan
in process.cdist
in cases where it is supported by the corresponding scorer.
I added support for None
and float("nan")
to process.cdist
for scorers which support it (all normalized scorers):
>>> process.cdist(["example1", "example2"], ["example2", None])
array([[ 87.5, 0. ],
[100. , 0. ]], dtype=float32)
>>> process.cdist(["example1", "example2"], ["example2", float("nan")])
array([[ 87.5, 0. ],
[100. , 0. ]], dtype=float32)