MaartenGr / PolyFuzz

Hi,
I am observing that tf-idf is givng exact match for terms that are not exact matches.

For eg:

test_tolist = ["k testtext", "testtext", "x testtext", "j testtext", "i q testtext"]
test_fromlist = ["i testtext"]

test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5,  model_id = "tfidf_test")

PolyFuzz(test_model).match(test_fromlist, test_tolist).get_matches()

Output:

	From	To	Similarity	To_2	Similarity_2	To_3	Similarity_3	To_4	Similarity_4	To_5	Similarity_5
0	i testtext	i q testtext	1	j testtext	1	x testtext	1	testtext	1	k testtext	1

Explanation:
Here i testtext is being exactly matched to "x testtext" and others even though there is a difference.
I also tested the same on RapidFuzz with scorer as fuzz.ratio and it is giving required result.
I am assuming the scorer in TF-IDF is set to partial_token_ratio as RapidFuzz is also giving same result.

That is correct. This implementation of the TF-IDF similarity measure removes n-grams that have whitespaces in them in order to prevent RAM issues when analyzing large datasets:

PolyFuzz/polyfuzz/models/_tfidf.py

Line 130 in b26638f

ngrams = [''.join(ngram) for ngram in ngrams if ' ' not in ngram]

If you do want that for your dataset, you can remove that line yourself and create a TF-IDF vectorizer with your own custom settings according to documentation.

TF-IDF is giving same score for different to_list