TF-IDF is giving same score for different to_list
ashutosh486 opened this issue · comments
Hi,
I am observing that tf-idf is givng exact match for terms that are not exact matches.
For eg:
test_tolist = ["k testtext", "testtext", "x testtext", "j testtext", "i q testtext"]
test_fromlist = ["i testtext"]
test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5, model_id = "tfidf_test")
PolyFuzz(test_model).match(test_fromlist, test_tolist).get_matches()
Output:
From | To | Similarity | To_2 | Similarity_2 | To_3 | Similarity_3 | To_4 | Similarity_4 | To_5 | Similarity_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | i testtext | i q testtext | 1 | j testtext | 1 | x testtext | 1 | testtext | 1 | k testtext | 1 |
Explanation:
Here i testtext is being exactly matched to "x testtext" and others even though there is a difference.
I also tested the same on RapidFuzz with scorer as fuzz.ratio and it is giving required result.
I am assuming the scorer in TF-IDF is set to partial_token_ratio as RapidFuzz is also giving same result.
That is correct. This implementation of the TF-IDF similarity measure removes n-grams that have whitespaces in them in order to prevent RAM issues when analyzing large datasets:
PolyFuzz/polyfuzz/models/_tfidf.py
Line 130 in b26638f
If you do want that for your dataset, you can remove that line yourself and create a TF-IDF vectorizer with your own custom settings according to documentation.