moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Splink 4] Find new matches can be simplified by creating a new linker

RobinL opened this issue · comments

This function is very complicated and relies on a lot of hacks to make it work.

I think it could be simplified by:

  • creating a new linker in link only mode with two datasets
  • computing the tf columns on the new records by joining to __splink__df_tf_with_concat
  • somehow ensuring the linker knows that it's a link only, and __splink__df_tf_with_concat is the left dataset, and the new records are the right dataset

We can also get rid of this since the tf columns on the new records can be obtained by joining to __splink__df_tf_with_concat