skrub-data / skrub

Describe the bug

The fuzzy join documentation gives the impression the worst match will get a score of 0.0, but it will get 0.5:

skrub/skrub/_fuzzy_join.py

Line 191 in 404275c

matching_score = 1 - (distance / 2)

Steps/Code to Reproduce

any call to fuzzy_join

Expected Results

the worst score is 0

Actual Results

the worst score is 0.5

Versions

main branch

I also wonder if specifying the threshold relative to the worst of the nearest neighbors is the most intuitive, compared to the difference between nearest and second nearest or some measure of the average distance between random rows

I also wonder if specifying the threshold relative to the worst of the nearest neighbors is the most intuitive, compared to the difference between nearest and second nearest or some measure of the average distance between random rows

Some discussion at #470

Seems like matching_score should be calculated as 1 - distance and not 1 - distance / 2.
Need to be careful, if people are using fuzzy_join with threshold, this change will be breaking for them.
But this need to be changed to be consistent with function documentation.
Also there is example using threshold < 0.5 https://github.com/skrub-data/skrub/blob/main/examples/04_fuzzy_joining.py#L183

@Tialo you are right about the example, see #760

it is also true that fixing the threshold computation would be a breaking change, but as the package has never been released yet we can make this kind of change

fuzzy_join's match_score starts at 0.5 not 0.0

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions