moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEAT] Duplicated code in `linker.X_from_labels_Y()`

samnlindsay opened this issue · comments

Is your proposal related to a problem?

Several methods in linker.py duplicate a lot of code by having separate functions, X_from_labels_table and X_from_labels_column where X is:

  • prediction_errors
  • truth_space_table
  • roc_chart
  • precision_recall_chart
  • accuracy_chart
  • confusion_matrix (DELETED)
  • threshold_selection_tool

These functions contribute almost 1000 lines to linker.py

Describe the solution you'd like

Adding arguments to distinguish between labels in the source data or in a separate table would allow for simpler function names and almost halve the lines of code by removing duplication. The charts functions mostly hinge on whether they use truth_space_table_from_labels_table or truth_space_table_from_labels_column to perform the same task.

For example linker.roc_chart_from_labels_table("labels") becomes something like linker.roc_chart("labels", from="table")

Additional context

You could argue that many of these methods are no longer required once #2003 is merged.