rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This library is a gem! Why score is so low in the example below?

rjalexa opened this issue · comments

First of all my heartfelt thanks to the developers and maintainers of this library.

I am using it to weight the different ways people's names are written in the news in my language.

One of the main problems is that in my culture often "Name Surname" can also be used as "Surname Name", also I should be not considering case ("Andreotti" and "ANDREOTTI" are used) and therefore I am using the code token_sort_ratio approach. This works well and empirically I am attaining a good accuracy with a threshold of 0.8.

I am confused about what makes the following get such a low score though:

from rapidfuzz import fuzz
urifrag = "'Éric_Rohmer'"
perstrnorm = "Eric_Rohmer"
score = fuzz.token_sort_ratio(perstrnorm, urifrag)

which yields a score of only 0.54 (periodic)

I also tried to substitute the '_' with a blank to separate the name_surname in independent tokens but the result is the same.

Using version 2.14.0 with python 3.10 (poetry managed project on MacOS).

Thanks for any clarification

fuzz.token_sort_ratio sorts the words alphabetically before matching, which is what you want since it will resort surname and name. However in your specific example the issue is that É has a higher number in unicode than R. So your comparision is equivalent to:

fuzz.ratio('rohmer éric', 'eric rohmer')

which is pretty low. One solution would be to preprocess your strings by replacing characters with their equivalent without an accent. A preprocessing function like this is not available, but something like this would do the job:

import unicodedata
from rapidfuzz import fuzz, utils

def preprocess(s):
        nfkd_form = unicodedata.normalize('NFKD', s)
        s = "".join([c for c in nfkd_form if not unicodedata.combining(c)])
        return utils.default_process(s)

fuzz.token_sort_ratio(perstrnorm, urifrag, processor=preprocess)

Thank you so much.

Will pursue your recommendation but empirically also observed that the score of the token_set are already very nice without transformations.

Take care and ciao from Rome - Italy!