Could process.cdist store JaroWinkler similarity (and other similarities) as a uint8 on a scale of 1-100

Question

Could process.cdist store JaroWinkler similarity (and other similarities) as a uint8 on a scale of 1-100

lcubeddu opened this issue a year ago · comments

Hello,

process.cdist can produce gigantic matrices if stored as a float32. For my use case, i end up with an 18GB matrix. The ability to store a uint8 is greatly appreciated, and works perfectly with Levenshtein similarity which is on a 0-100 scale.

However, distance.JaroWinkler.similarity and other similarity functions return on a 0-1 scale, which means that the uint8 is currently rounded to 0 or to 1.

Could we have native C++ functions that return on different scales ?

Max Bachmann · Answer 1 · Tue Jul 04 2023 19:30:24 GMT+0800 (China Standard Time)

I think the best way would be to add a score_multiplier argument to process.cdist which simple multiplies the results of the scorer before storing them in the result matrix, so it could be used e.g. as:

process.cdist(..., dtype=np.uint8, score_multiplier=255)

Max Bachmann · Answer 2 · Sun Oct 22 2023 04:25:55 GMT+0800 (China Standard Time)

27f178b adds a score_multiplier argument to process.cdist which allows this kind of usage.