moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEAT] Save out `SplinkDataFrame` metadata

ADBond opened this issue · comments

As mentioned in #1971, parquet format supports arbitrary key-value metadata. As SplinkDataFrames now support such metadata (in particularly used for storing table-creation thresholds), it would be nice if this could be written/read from parquet.

Backend notes:

  • Supported in duckdb, though think only (currently, 0.10.0) using a literal struct in SQL (which would thus need to be carefully constructed) rather than via e.g. subquery
  • Doesn't appear to be directly supported in spark, could possibly go via pyarrow
  • athena uses arrow under-the-hood so should be okay.
  • postgres/sqlite we don't currently have a to_parquet(), but could look into implementing