Derive surrogate key from record
RalfKow opened this issue · comments
Feature description
Usually, I create a surrogate key from the unique/primary key. At the moment the column dlt_id is defined by
Line 65 in 713aa31
Are you a dlt user?
Yes, I'm already a dlt user.
Use case
I would have to define the primary key's only at one place
Proposed solution
Define the column "_dlt_id" with the function below, using the unique keys defined in the source or all columns if not defined
import hashlib
result = hashlib.md5(list(filter(None,my_list)))
Therefore I could use the key "__dlt_id" for incremential load
Related issues
No response
Hey @RalfKow maybe one of the things will help you:
- you can use compound primary keys for incremental load.
primary_key
accepts a list of columns - if you want to have compound cursor field you'll need to provide a JSON path that select multiple fields and a custom
last_value_func
- you can bring your own
_dlt_id
just create a field with that name in your data anddlt
will use it. you can do it before you yield data item from your resource or byadd_map
function (look forinsert_at
argument to insert this transform before incremental transform as you want to use_dlt_id
in it
Thanks for the suggested solutions,
I work in a bigger company where ther are many teams.
I would like to have two metacolumns for scd
_hash (which is a hash over pk)
_hash_diff(which is a hash over the whole record)
which should be created the same way in every team.
I see multiple options to do this
I write and maintain an additional package
I ask you about a feature
I impement it in dbt the template and hope that people are gone do it the right way.
I need the _hash to check if I should append or update
and the _hash_diff to check if I should update or not
pk | _dlt_id |
---|---|
Test | ynaJFzcFDeA3dw |
Test | NpZBMus9sXYMlw |
Test | 3p6RiWyCbGwfHQ |
@RalfKow OK I think now I get it. do you want to implement your own scd mechanism right? the one that we have (scd2) will create _dlt_id
over the whole record (your _hash_diff
). anyway what we can do is to add a map transform to dlt.source.helpers
that adds columns with hashes. you would use it like
table = sql_table(name).add_map(add_row_hash("_hash_diff")).add_map(add_key_hash("_hash")) # create resource
pipeline.run(table)