rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BUG: Can't handle `nan`

Zeroto521 opened this issue · comments

commented
In [1]: from rapidfuzz import fuzz

In [2]: fuzz.ratio("this is a test", float("nan"))  # same to `np.nan`
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 fuzz.ratio("this is a test", float("nan"))
File src/rapidfuzz/fuzz_cpp.pyx:72, in rapidfuzz.fuzz_cpp.ratio()
File ./src/rapidfuzz/cpp_common.pxd:379, in cpp_common.preprocess_strings()
File ./src/rapidfuzz/cpp_common.pxd:332, in cpp_common.conv_sequence()
File ./src/rapidfuzz/cpp_common.pxd:300, in cpp_common.hash_sequence()
TypeError: object of type 'float' has no len()

closed to seatgeek/thefuzz#41

Since this was never supported, I do not think this is really a bug. However I think it would be a reasonable extension to handle float("nan") similar to None.

rapidfuzz.fuzz.*, rapidfuzz.distance.*.normalized_distance and rapidfuzz.distance.*.normalized_similarity are now able to handle both None and nan. Other scorers do not support them, since it is unclear what the result in this case should be. This is supported by rapidfuzz.process.* as well. The only exception is rapidfuzz.process.cdist as described in #293.

commented

Other scorers do not support them, since it is unclear what the result in this case should be.

These conditions could return None or nan.

  • None or nan means the result is missing or doesn't know.
  • It also could avoid breaking the calculation instead of raising an error.