Is there alias feature ?

Question

Is there alias feature ?

mkandulavm opened this issue 2 years ago · comments

mkandulavm commented 2 years ago

Hi

Is there a way to provide a alias options ?
For example, "street", "st", "road" could be alias for some scenarios.

How can this be done ? Thank you.

Max Bachmann · Answer 1 · Tue Aug 16 2022 21:21:19 GMT+0800 (China Standard Time)

So far there is no way to alias characters/words. There is already a request for character dependent weights: #241.
This would allow you to alias individual elements by setting their substitution cost as 0. This would still only work on individual symbols:

Levenshtein.distance(["street", "road"], ["st", "st"]) # result is 2
weights=...
weights["street", "st"] = 0
weights["st", "street"] = 0
weights["road", "st"] = 0
weights["st", "road"] = 0
weights["street", "road"] = 0
weights["road", "street"] = 0
Levenshtein.distance(["street", "road"], ["st", "st"], weights=weights) # result is 0

which might be enough for your use case.

mkandulavm · Answer 2 · Tue Aug 16 2022 21:26:14 GMT+0800 (China Standard Time)

This is exactly what I need !!
But, is an equivalent call exposed in c++ ?

Also, which scorer is best for such scenarios (since tokens can be presented without order).

Max Bachmann · Answer 3 · Tue Aug 16 2022 21:33:40 GMT+0800 (China Standard Time)

But, is an equivalent call exposed in c++ ?

So far this feature does not exist in either of them. However it will absolutely be implemented in C++. The Python implementation will only wrap it. It will extend: https://github.com/maxbachmann/rapidfuzz-cpp/blob/d937555ad76a6f1ed853ab4b7102a7b22b6f0fcf/rapidfuzz/distance/Levenshtein.hpp#L142

Also, which scorer is best for such scenarios (since tokens can be presented without order).

At least right now the feature is only planned for Levenshtein/OSA/DamerauLevenshtein. None of those sort the tokens before comparing them.

i30817 · Answer 4 · Sat Sep 03 2022 02:31:11 GMT+0800 (China Standard Time)

You can also preprocess the input strings such that the fuzz operation occurs in x and the result is (x,y) with y being the original string. Then you preprocess things so that the words you want to be the same score are replaced by one canonical word in x.

This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that.

The more you replace words (or do similar tricks like removing combining characters accents), the more likely that there will be 2 or more 'same best scores', which can lead to inconsistent results on repeated runs with the same dataset.

If it matters, get the 2 or 3 best ones (or until they're not the same score) then check if they have the same score, and if they do, either chose a consistent order for the 'winner' or if both are valid somehow, and you can, combine results.

Max Bachmann · Answer 5 · Sat Sep 03 2022 02:48:06 GMT+0800 (China Standard Time)

This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that.

This can be faster than using weights for Levenshtein for this purpose, since the weighted Levenshtein distance is quite a bit slower to calculate than the uniform Levenshtein distance. So e.g. when comparing a string to a list of known strings you can preprocess ahead of time it is likely faster to preprocess the strings yourself.

Max Bachmann · Answer 6 · Tue Dec 13 2022 22:16:53 GMT+0800 (China Standard Time)

Closing this, since it is tracked as part of #241