ScalefreeCOM / datavault4dbt

Scalefree's dbt package for a Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt including the Staging Area, DV2.0 main entities, PITs and Snapshot Tables.

Home Page:https://www.scalefree.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Effectivity handled with duplicate unique keys

cjpotgieter opened this issue · comments

Hi
I have a question on how the logic handles your proposed approach of handling effectivity that comes from the source.
You specify that these attributes should just be handled like normal descriptive attributes.
In the Satellite it contains a parameter for unique key which is typically only the hk and the load date. Is this still the case?
Does dbt create a primary key on the table for this, or is this just used for incremental load checks? Should one include the effectivity dates in here as well?

We did a small test on the sql branch (which we got when it was still on) and it actually seemed to work because there was no primary key on the actual tables, so the fact that there were duplicate "unique keys" didn't seem to error. However we didn't test subsequent loads

Hi @cjpotgieter and thanks for reaching out!

Does dbt create a primary key on the table for this, or is this just used for incremental load checks?

Dbt does not create a primary key based on this! The information specified in the "unique key" config will only influence the loading mechanism in the "incremental" materialisation. But it is not required when using our package, since we handle all incremental logic ourselves.

Should one include the effectivity dates in here as well?

Since it does affect the Primary Key constraint, you don't need to include them there.
Could you describe your data a bit more? Are you getting multiple effectivity ranges for one HK with the same load_date?

Best regards,
Tim

The data:
Yes, exactly, with each load there is a new batch of data with from and to dates. So the hash key (from business key) and load date will be repeated multiple times, but each of the (lets say 20) records will have a different from and to date.

I have a master data source where historical data gets updated with new 'valid from' and 'valid to' periods. In each batch, I receive the full history, including valid/to periods, per business key. If I understand correctly, your case sounds similar. You could use a multi-active satellite with a ma key, such as ROW_NUMBER() OVER (PARTITION BY hk,ldts order by valid_from, ISNULL(valid_to, '8888-12-31')) and include all the columns as payload

It does sound similar yes. Can I check via your MA satellite wiki example:
Should the src_ma_key be set to the ROW_NUMBER expression that you mentioned and NOT a value from source?

{{ config(materialized='incremental') }}

{%- set yaml_metadata -%}
source_model: 'stg_customer'
parent_hashkey: 'hk_customer_h'
src_hashdiff: 'hd_customer_s'
src_ma_key: 'ma_attribute'
src_payload:
- phonenumber
- address
{%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.ma_sat_v0(source_model=metadata_dict['source_model'],
parent_hashkey=metadata_dict['parent_hashkey'],
src_hashdiff=metadata_dict['src_hashdiff'],
src_ma_key=metadata_dict['src_ma_key'],
src_payload=metadata_dict['src_payload']) }}

@cmettler @tkirschke Any answer on this?

Hi @cjpotgieter,
yes, you also need to change the stage to multi active by adding this

derived_columns:
ma_attribute:
value: 'row_number() over (partition by customer_key,ldts order by validfrom,validto)'
datatype: int

include_source_columns: true
multi_active_config:
multi_active_key:
- ma_attribute
main_hashkey_column: 'hk_customer_h'

please note that this is my personal view and not that of scalefree.

Ok thanks @cmettler
What is the scalefree opinion?

@cjpotgieter Hi thanks for the question. In your case I believe I would also use a multiactive satellite, but instead of a window function I would rather use the valid_from and valid_to columns you have as multi active attributes. So for each business key coming from the source you would have a different set of valid_from and valid_to dates per load_datetime. Try it and see if it fits your use case :)
Also remember the hashdiff is calculated per group.
Hope this helps!

Thanks @bschlottfeldt
I have also heard that for this type of load Scalefree prefers to add nanoseconds to the load date and then not create a multi-active satellite.

  1. Do you agree with this approach?
  2. Is there a way to do the nano second approach with your dbt package?