MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.

Home Page:https://maartengr.github.io/PolyFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TF-IDF is giving same score for different to_list

ashutosh486 opened this issue · comments

Hi,
I am observing that tf-idf is givng exact match for terms that are not exact matches.

For eg:

test_tolist = ["k testtext", "testtext", "x testtext", "j testtext", "i q testtext"]
test_fromlist = ["i testtext"]

test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5,  model_id = "tfidf_test")

PolyFuzz(test_model).match(test_fromlist, test_tolist).get_matches()

Output:

From To Similarity To_2 Similarity_2 To_3 Similarity_3 To_4 Similarity_4 To_5 Similarity_5
0 i testtext i q testtext 1 j testtext 1 x testtext 1 testtext 1 k testtext 1

Explanation:
Here i testtext is being exactly matched to "x testtext" and others even though there is a difference.
I also tested the same on RapidFuzz with scorer as fuzz.ratio and it is giving required result.
I am assuming the scorer in TF-IDF is set to partial_token_ratio as RapidFuzz is also giving same result.

That is correct. This implementation of the TF-IDF similarity measure removes n-grams that have whitespaces in them in order to prevent RAM issues when analyzing large datasets:

ngrams = [''.join(ngram) for ngram in ngrams if ' ' not in ngram]

If you do want that for your dataset, you can remove that line yourself and create a TF-IDF vectorizer with your own custom settings according to documentation.