rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could process.cdist store JaroWinkler similarity (and other similarities) as a uint8 on a scale of 1-100

lcubeddu opened this issue · comments

Hello,

process.cdist can produce gigantic matrices if stored as a float32. For my use case, i end up with an 18GB matrix. The ability to store a uint8 is greatly appreciated, and works perfectly with Levenshtein similarity which is on a 0-100 scale.

However, distance.JaroWinkler.similarity and other similarity functions return on a 0-1 scale, which means that the uint8 is currently rounded to 0 or to 1.

Could we have native C++ functions that return on different scales ?

I think the best way would be to add a score_multiplier argument to process.cdist which simple multiplies the results of the scorer before storing them in the result matrix, so it could be used e.g. as:

process.cdist(..., dtype=np.uint8, score_multiplier=255)

27f178b adds a score_multiplier argument to process.cdist which allows this kind of usage.