dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

Home Page:https://dlthub.com/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Derive surrogate key from record

RalfKow opened this issue · comments

Feature description

Usually, I create a surrogate key from the unique/primary key. At the moment the column dlt_id is defined by

def uniq_id_base64(len_: int = 16) -> str:
, which doesnt lead to a surrogate key.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

I would have to define the primary key's only at one place

Proposed solution

Define the column "_dlt_id" with the function below, using the unique keys defined in the source or all columns if not defined

import hashlib

result = hashlib.md5(list(filter(None,my_list)))

Therefore I could use the key "__dlt_id" for incremential load

Related issues

No response

Hey @RalfKow maybe one of the things will help you:

  1. you can use compound primary keys for incremental load. primary_key accepts a list of columns
  2. if you want to have compound cursor field you'll need to provide a JSON path that select multiple fields and a custom last_value_func
  3. you can bring your own _dlt_id just create a field with that name in your data and dlt will use it. you can do it before you yield data item from your resource or by add_map function (look for insert_at argument to insert this transform before incremental transform as you want to use _dlt_id in it

Thanks for the suggested solutions,
I work in a bigger company where ther are many teams.
I would like to have two metacolumns for scd
_hash (which is a hash over pk)
_hash_diff(which is a hash over the whole record)
which should be created the same way in every team.

I see multiple options to do this

I write and maintain an additional package
I ask you about a feature
I impement it in dbt the template and hope that people are gone do it the right way.

I need the _hash to check if I should append or update
and the _hash_diff to check if I should update or not

pk _dlt_id
Test ynaJFzcFDeA3dw
Test NpZBMus9sXYMlw
Test 3p6RiWyCbGwfHQ

@RalfKow OK I think now I get it. do you want to implement your own scd mechanism right? the one that we have (scd2) will create _dlt_id over the whole record (your _hash_diff). anyway what we can do is to add a map transform to dlt.source.helpers that adds columns with hashes. you would use it like

table = sql_table(name).add_map(add_row_hash("_hash_diff")).add_map(add_key_hash("_hash"))  # create resource
pipeline.run(table)