fuzzy_join's match_score starts at 0.5 not 0.0
jeromedockes opened this issue · comments
Describe the bug
The fuzzy join documentation gives the impression the worst match will get a score of 0.0, but it will get 0.5:
Line 191 in 404275c
Steps/Code to Reproduce
any call to fuzzy_join
Expected Results
the worst score is 0
Actual Results
the worst score is 0.5
Versions
main branch
I also wonder if specifying the threshold relative to the worst of the nearest neighbors is the most intuitive, compared to the difference between nearest and second nearest or some measure of the average distance between random rows
I also wonder if specifying the threshold relative to the worst of the nearest neighbors is the most intuitive, compared to the difference between nearest and second nearest or some measure of the average distance between random rows
Some discussion at #470
Seems like matching_score
should be calculated as 1 - distance
and not 1 - distance / 2
.
Need to be careful, if people are using fuzzy_join
with threshold, this change will be breaking for them.
But this need to be changed to be consistent with function documentation.
Also there is example using threshold < 0.5 https://github.com/skrub-data/skrub/blob/main/examples/04_fuzzy_joining.py#L183