This library is a gem! Why score is so low in the example below?

Question

This library is a gem! Why score is so low in the example below?

rjalexa opened this issue a year ago · comments

First of all my heartfelt thanks to the developers and maintainers of this library.

I am using it to weight the different ways people's names are written in the news in my language.

One of the main problems is that in my culture often "Name Surname" can also be used as "Surname Name", also I should be not considering case ("Andreotti" and "ANDREOTTI" are used) and therefore I am using the code token_sort_ratio approach. This works well and empirically I am attaining a good accuracy with a threshold of 0.8.

I am confused about what makes the following get such a low score though:

from rapidfuzz import fuzz
urifrag = "'Éric_Rohmer'"
perstrnorm = "Eric_Rohmer"
score = fuzz.token_sort_ratio(perstrnorm, urifrag)

which yields a score of only 0.54 (periodic)

I also tried to substitute the '_' with a blank to separate the name_surname in independent tokens but the result is the same.

Using version 2.14.0 with python 3.10 (poetry managed project on MacOS).

Thanks for any clarification

Max Bachmann · Answer 1 · Tue Jul 04 2023 19:25:48 GMT+0800 (China Standard Time)

fuzz.token_sort_ratio sorts the words alphabetically before matching, which is what you want since it will resort surname and name. However in your specific example the issue is that É has a higher number in unicode than R. So your comparision is equivalent to:

fuzz.ratio('rohmer éric', 'eric rohmer')

which is pretty low. One solution would be to preprocess your strings by replacing characters with their equivalent without an accent. A preprocessing function like this is not available, but something like this would do the job:

import unicodedata
from rapidfuzz import fuzz, utils

def preprocess(s):
        nfkd_form = unicodedata.normalize('NFKD', s)
        s = "".join([c for c in nfkd_form if not unicodedata.combining(c)])
        return utils.default_process(s)

fuzz.token_sort_ratio(perstrnorm, urifrag, processor=preprocess)

Robert Alexander · Answer 2 · Tue Jul 04 2023 20:03:35 GMT+0800 (China Standard Time)

Thank you so much.

Will pursue your recommendation but empirically also observed that the score of the token_set are already very nice without transformations.

Take care and ciao from Rome - Italy!