ScalefreeCOM / datavault4dbt

Scalefree's dbt package for a Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt including the Staging Area, DV2.0 main entities, PITs and Snapshot Tables.

Home Page:https://www.scalefree.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incremental loads for hubs and links don't use high-water-mark

lwehle opened this issue · comments

commented

Hi,

I stumbled upon our hubs and links not using high-water-mark for their incremental loading.
For this issue I would only go into hubs since links probably share the same cause.

dbt version: 1.4.6
datavault4dbt version: 1.0.17
database: snowflake

We defined our own aliases in the dbt_project.yml

#Column Aliases
  datavault4dbt.ldts_alias: 'CDP_LOADDTS'
  datavault4dbt.rsrc_alias: 'CDP_RECORD_SOURCE'

Every stage defines these two attributes:

ldts: CDP_LOADDTS
rsrc: '[static source identifier]'

Our hub definition is very slim, as we define the matching columns (HK/BK) for the hubs in the source_models.
For multiple sources:

{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
hashkey: HASHKEY_SK
business_keys: BUSINESSKEY_BK
source_models:
    V_MODEL1: {}
    V_MODEL2: {}
    V_MODEL3: {}
{%- endset -%}

For single source hubs:

{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
hashkey: HASHKEY_SK
business_keys: BUSINESSKEY_BK
source_models: V_MODEL4
{%- endset -%}

Both definitions always process the entire history of the stage data to determine the delta. Are we missing something in the definition that is necessary to calculate the high-water-mark?

We would like to get the same result for delta calculation analogous to the sat definition and decrease our processing time.

{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
source_model: V_MODEL4
parent_hashkey: HASHKEY_SK
src_hashdiff:
  source_column: S_SAT_HASHDIFF
  alias: HASHDIFF
src_payload:
  - COL1
  - COL2
{%- endset -%}

Kind regards,
Lars

Hi @lwehle and thanks for reaching out!

To make multi-source entities benefit from High-Water Mark loading, you have to define the parameter "rsrc_static" per source_model. Check this wiki page to get some more details. After setting this, the HWM should be applied automatically!

For the single source entities, the only way to activate the HWM would currently also be to set the rsrc_static attribute.
You can do this by changing
source_models: V_MODEL4
to this:

source_models: 
    V_MODEL4: 
       rsrc_static: <rsrc_static_value>

I completely agree that for single source entities this should not be neccessary, and I wasn't really aware of it! We will definetely add it to the Backlog for future improvements, but for now you should be good to achieve what you want with these modifications.

And you could argue that this kind of single-source model definition makes it easier to add more sources in the future!

Let me know if this works for you!

Kind regards,
Tim

commented

Hi @tkirschke,

thank you for your quick response! After adding the rsrc_static parameter HWM CTEs are generated.

I had a different idea of the rsrc attribute until now. I assumed that the column is defined per stage model and it is automatically used as the source identifier. In our case all rsrc identifiers are static and to store them again in hubs and links seems redundant, but I understand that the hub must know from somewhere what rsrc value the stage model has.

Thanks for the clarification!

Kind regards,
Lars

commented

Hi,

there is an interesting code block for links when rsrc_static is set.
In the src_new_x cte, the rsrc_value V_STAGE_MODEL gets "deconstructed":

        SELECT
            LINKHASHKEY_SK AS LINKHASHKEY_SK,
            HASHKEY1_SK,
            HASHKEY2_SK,
            LOADDTS,
            RECORD_SOURCE
        FROM STAGE_MODEL src    
        INNER JOIN max_ldts_per_rsrc_static_in_target max ON
        (max.rsrc_static = 'V'OR
            max.rsrc_static = '_'OR
            max.rsrc_static = 'S'OR
            max.rsrc_static = 'T'OR
            max.rsrc_static = 'A'OR
            max.rsrc_static = 'G'OR
            max.rsrc_static = 'E'OR
            max.rsrc_static = '_'OR
            max.rsrc_static = 'M'OR
            max.rsrc_static = 'O'OR
            max.rsrc_static = 'D'OR
            max.rsrc_static = 'E'OR
            max.rsrc_static = 'L')
        WHERE src.LOADDTS > max.max_ldts

    )

For hubs it's looking good.
After some poking around, I found out that the type of rsrc_statics differs between hubs (list) and links (string) and the for loop {%- for rsrc_static in rsrc_statics -%} outputs every char.

Changing {%- set rsrc_statics = source_models_rsrc_dict[source_model]['rsrc_static'] %} to {%- set rsrc_statics = ns.source_models_rsrc_dict[source_model] -%} in macros/tables/snowflake/link.sql did the trick, but I don't know if there are other side-effects.

Kind regards,
Lars

Hi @lwehle ,

I have identified the belonging bug in the link macro and fixed it in this branch: https://github.com/ScalefreeCOM/datavault4dbt/tree/hotfix_rsrc-static

It works for me now, let me know if this solves the issue for you as well!

Kind regards and have a nice weekend,
Tim

commented

Hi @tkirschke,

I pulled the changed files for this macro from the 1.1 branch and it solves the issue for me.
Thank you!

Kind regards
Lars