moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEAT] Internally estimate probabilities for blocking-rule-related comparisons to improve EM

samkodes opened this issue · comments

Is your proposal related to a problem?

Currently, the EMTrainingSession class has the default behaviour of disabling all comparisons used in blocking.
I believe that this is often unnecessary and discards information that may improve EM fitting. Specifically, this is the case whenever the blocking rule does not entail a specific match level.

For example, if we have a birthdate field and create one training block on year of birth, within this block it is still informative to distinguish between an exact birthdate match and an inexact match. While we would not want to save the estimated m-probabilities for the birthdate field to our model (because they are conditional on the blocking rule), estimating birthdate m-probabilities in this training session may still help improve the EM estimates for other parameters by affecting the overall match probability estimates. (Note that even if the assumption of field independence conditional on match status holds, there typically still is unconditional dependence, which is the problem here.)

Similarly, we may block on a name's initial or part of a postal code, and find value in distinguishing exact name or exact postal code matches during EM.

While the current implementation allows the user to specify comparisons to turn off (via the "comparisons_to_deactivate" parameter), passing an empty list is interpreted as disabling all variables used in the blocking rule (because 'not []' evaluates to True in the appropriate section of EMTrainingSession.init() ). Moreover, even if we could include all comparisons this way, we would not want comparisons related to the blocking rule to be saved back to the model.

Describe the solution you'd like

The parameter "comparisons_to_deactivate" should distinguish between the default value of None and a user-supplied value of an empty list ([]). The latter should mean "do not deactivate any comparisons", whereas "None" should mean "deactivate all comparisons related to blocking rules."

Separate logic will be required to avoid subsequently merging estimates for columns used in the blocking rules into the main linker's model. Unless there is a need to specify these manually, enforcing the default behaviour of not merging any comparison with a column used in the blocking rules makes sense.

Internally it will be necessary in the code to distinguish between comparisons suppressed for this training session and comparisons whose estimates we do not want to save to the global model.

It will also be necessary to force u-estimation on for comparisons related to the blocking rule since u-probabilities will in general be affected by the blocking rule.

Preserving backwards-compatibility for the behaviour of "comparisons_to_deactivate" will require a bit of thought, if that is a priority.

Describe alternatives you've considered

Additional context

(This problem came up while exploring a test implementation of #2030; variables I wanted to use to label cases using a semi-supervised approach were stripped out because they were used for blocking.)