Could process.cdist store JaroWinkler similarity (and other similarities) as a uint8 on a scale of 1-100
lcubeddu opened this issue · comments
Hello,
process.cdist can produce gigantic matrices if stored as a float32. For my use case, i end up with an 18GB matrix. The ability to store a uint8 is greatly appreciated, and works perfectly with Levenshtein similarity which is on a 0-100 scale.
However, distance.JaroWinkler.similarity and other similarity functions return on a 0-1 scale, which means that the uint8 is currently rounded to 0 or to 1.
Could we have native C++ functions that return on different scales ?
I think the best way would be to add a score_multiplier
argument to process.cdist
which simple multiplies the results of the scorer before storing them in the result matrix, so it could be used e.g. as:
process.cdist(..., dtype=np.uint8, score_multiplier=255)
27f178b adds a score_multiplier
argument to process.cdist
which allows this kind of usage.