rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is there alias feature ?

mkandulavm opened this issue · comments

Hi

Is there a way to provide a alias options ?
For example, "street", "st", "road" could be alias for some scenarios.

How can this be done ? Thank you.

So far there is no way to alias characters/words. There is already a request for character dependent weights: #241.
This would allow you to alias individual elements by setting their substitution cost as 0. This would still only work on individual symbols:

Levenshtein.distance(["street", "road"], ["st", "st"]) # result is 2
weights=...
weights["street", "st"] = 0
weights["st", "street"] = 0
weights["road", "st"] = 0
weights["st", "road"] = 0
weights["street", "road"] = 0
weights["road", "street"] = 0
Levenshtein.distance(["street", "road"], ["st", "st"], weights=weights) # result is 0

which might be enough for your use case.

This is exactly what I need !!
But, is an equivalent call exposed in c++ ?

Also, which scorer is best for such scenarios (since tokens can be presented without order).

But, is an equivalent call exposed in c++ ?

So far this feature does not exist in either of them. However it will absolutely be implemented in C++. The Python implementation will only wrap it. It will extend: https://github.com/maxbachmann/rapidfuzz-cpp/blob/d937555ad76a6f1ed853ab4b7102a7b22b6f0fcf/rapidfuzz/distance/Levenshtein.hpp#L142

Also, which scorer is best for such scenarios (since tokens can be presented without order).

At least right now the feature is only planned for Levenshtein/OSA/DamerauLevenshtein. None of those sort the tokens before comparing them.

You can also preprocess the input strings such that the fuzz operation occurs in x and the result is (x,y) with y being the original string. Then you preprocess things so that the words you want to be the same score are replaced by one canonical word in x.

This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that.

The more you replace words (or do similar tricks like removing combining characters accents), the more likely that there will be 2 or more 'same best scores', which can lead to inconsistent results on repeated runs with the same dataset.

If it matters, get the 2 or 3 best ones (or until they're not the same score) then check if they have the same score, and if they do, either chose a consistent order for the 'winner' or if both are valid somehow, and you can, combine results.

This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that.

This can be faster than using weights for Levenshtein for this purpose, since the weighted Levenshtein distance is quite a bit slower to calculate than the uniform Levenshtein distance. So e.g. when comparing a string to a list of known strings you can preprocess ahead of time it is likely faster to preprocess the strings yourself.

Closing this, since it is tracked as part of #241