skrub-data / skrub

Prepping tables for machine learning

Home Page:https://skrub-data.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fuzzy_join's match_score starts at 0.5 not 0.0

jeromedockes opened this issue · comments

Describe the bug

The fuzzy join documentation gives the impression the worst match will get a score of 0.0, but it will get 0.5:

matching_score = 1 - (distance / 2)

Steps/Code to Reproduce

any call to fuzzy_join

Expected Results

the worst score is 0

Actual Results

the worst score is 0.5

Versions

main branch

I also wonder if specifying the threshold relative to the worst of the nearest neighbors is the most intuitive, compared to the difference between nearest and second nearest or some measure of the average distance between random rows

I also wonder if specifying the threshold relative to the worst of the nearest neighbors is the most intuitive, compared to the difference between nearest and second nearest or some measure of the average distance between random rows

Some discussion at #470

Seems like matching_score should be calculated as 1 - distance and not 1 - distance / 2.
Need to be careful, if people are using fuzzy_join with threshold, this change will be breaking for them.
But this need to be changed to be consistent with function documentation.
Also there is example using threshold < 0.5 https://github.com/skrub-data/skrub/blob/main/examples/04_fuzzy_joining.py#L183

@Tialo you are right about the example, see #760

it is also true that fixing the threshold computation would be a breaking change, but as the package has never been released yet we can make this kind of change