1bk / dbt-teach-in

A repo with examples for a more advance dbt_ teach-in session

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dbt_ Teach-in

Made with Python Python version GitHub code size in bytes GitHub repo size Website GitHub

Introduction


Introducing (and a refresher of) the more advance dbt_ features and usage. This repo contains the examples used during the teach-in session.

Video Recording of the Teach-in Session

dbt-teach-in-video

Contents


Quick Links


Requirements


  1. A Database

    • You may either use the Analytics Redshift; or
    • Spin up a local Postgres database instance (using docker)
  2. Python

    • Version 3.6 or above
  3. Git

    • To clone this repository on Github
  4. dbt_

  5. For other Python modules (if any), see:

Initial Setup


  1. Prepare the Database

    • If you are using the Analytics Redshift, make sure to

      • Set the profile.yml in the dbt project accordingly.
        • Set your user
        • Set your schema
        • Change the target to redshift
      • Export your password to the environment variable:
        export PASS='<your_password_here>'
    • If you want to run a local Postgres database instance, you will need to:

      • Install Docker
      • Docker pull and run a local Postgres instance
        docker run --rm --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 postgres
        Tips: add -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data to persist the database.
      • To stop the database when done, run:
        docker stop pg-docker
  2. Installing Python

  3. Cloning this repository

    • If you haven't already, install Git (we recommend installing via Homebrew)
    • Create a directory
      mkdir ~/github
      cd ~/github
    • Clone the repository
      git clone git@github.com:1bk/dbt-teach-in.git

      If you are having trouble with this, please reach out to me.

  4. Installing dbt_

    • While there are many ways to install dbt_, we suggest to install it in a python virtual environment.
    • Follow the official dbt_ guide that details each step:
      1. Create a local environment
        cd dbt-teach-in                     # set the working directory to the cloned repo
         
        python3 -m venv dbt-env             # create the environment
        source dbt-env/bin/activate         # activate the environment
      2. Install dbt
        pip install dbt

Topics


The topics here is shared in the order in which it was presented.

Note:

  • Here we are applying dbt on a local Postgres instance.
  • Make sure to be in the dbt project directory
    cd ~/github/dbt-teach-in/dbt_teach_in

Seed

Teach-in file reference:

Teach-in dbt_ documentation reference:

Official dbt_ docs:

  1. Specify a schema to populate the seed files

    • In dbt_project.yml
      seeds:
        dbt_teach_in:
          +schema: seeds
          quote_columns: true
  2. Seed the data

    • Making sure the files you want to seed are in /data, run:
      dbt seed --profiles-dir ./
  3. Setting custom columns

    • In dbt_project.yml
      seeds:
        dbt_teach_in:
          ...
      
          raw_orders:
            # Override column types
            +column_types:
              order_priority: varchar(1)
  4. Enable/Disable seed files:

    • In dbt_project.yml
      seeds:
        dbt_teach_in:
          +schema: seeds
      
          ...
      
          snapshots:
            +schema: snapshots
      
            raw_orders_snapshot_backup:
              +enabled: true
  5. Seeding specific files:

    • Using --select:
      dbt seed --profiles-dir ./ --select raw_orders_snapshot_backup 

Tests

Official dbt_ docs:

  1. Schema Tests

    1. Creating simple schema tests:

      • In the properties (e.g. schema.yml files), add tests:
      • For example, in /data/seeds.yml (Yes, you can do it for seeds file):
        seeds:
          - name: raw_orders
            description: ...
            columns:
              - name: id
                description: Integer ID of an order entry in this table.
                tests:
                  - unique
                  - not_null
        
              - name: order_priority
                tests:
                  - accepted_values:
                      values: ["C", "H", "M", "L"]
        
              ...
    2. Creating relationships tests between tables:

      • In the properties (e.g. schema.yml files), add tests:
      • For example, in /data/seeds.yml:
        seeds:
          - name: raw_orders
            description: ...
            columns:
              ... 
        
              - name: country_id
                description: Integer ID of a country entry in the `raw_countries` table.
                tests:
                  - relationships:
                      to: ref('raw_countries')
                      field: id
        
              - name: item_type_id
                description: Integer ID of an item type entry in the `raw_item_types` table.
                tests:
                  - relationships:
                      to: ref('raw_item_types')
                      field: id
    3. Reusing tests using anchors in yaml:

      • For example, in /models/reporting/staging/reporting_staging.yml.yml:
      • Define an anchor:
        ##############################################
        ## Defining some anchor labels to stay DRY. ##
        ##############################################
        
        test__boolean_true_only: &test__boolean_true_only
          tests:
            - not_null
            - accepted_values:
                values:
                  - true
      • Using the anchor:
        models:
          - name: stg_orders__calc_check
            description: |
              A simple view use to visualy check the reports v.s. calculated:
                - Revenue
                - Cost
                - Profit
        
              Schema test is executed on the Boolean column and throws error if
              any calculation is off!
        
            columns:
              - name: same_reveue
                description: ""
                <<: *test__boolean_true_only
        
              - name: same_cost
                description: ""
                <<: *test__boolean_true_only
        
              - name: same_profit
                description: ""
                <<: *test__boolean_true_only
    4. But note that Schema test only show if a test PASS or FAILS. To get the number of rows that fails a particular tests, you can use Data Test instead.

  2. Data Tests

    1. Creating simple data test:

      • Create a query that would return rows for a specific conditions (in the WHERE clause).
      • If rows are returned, then the test will fail and return an ERROR (default) or WARN (see below)
      • Save the query in the /tests directory
      • See example test_stg_orders__calc_check.sql
    2. Defining custom config for each test:

      • By default, a test is enabled and the severity is error.
      • To overwrite this, define custom config in the test's .sql files:
        {{ "{{
            config(
                enabled=true | false,
                severity='warn' | 'error'
            )
        " }}}}
      • See example test_stg_orders__calc_check.sql

Snapshots

Use snapshot to track changes to a table overtime. You can also use it to view a snapshot of the table at a specific period in time.

  1. Create a snapshot

    1. Follow the official dbt guide
    2. For example, see /snapshots/raw_orders_snapshot.sql
  2. To revisit the table at a specific period in time

    1. In our example, there was an error in the "units_sold" in the raw_orders table where the number was doubled in the month of June 2017.
    2. We use a custom query to put ourselves at the point in time before or after the fix was executed (i.e. the fix was on '2017-07-01').
    3. For the custom query, see /analysis/tracing_units_sold_error.sql
    4. Documentation of the Query

Analyses

Teach-in file reference:

Teach-in dbt_ documentation reference:

Official dbt_ docs:

Basically, if you have a "analytical" queries that you want to version control and also run using dbt but not materialise the table, then you can dump them in the /analysis directory.

  1. Remember, anything in the /analysis directory will not run when you execute dbt run. Instead, it will only be compiled.
  2. Compiling turns the queries in the .sql files to runnable sql you can copy & paste into your SQL clients or Sisense (Periscope)
  3. To compile manually, just run dbt compile.

Jinja

Official dbt_ docs:

  1. Setting variables in dbt_project.yml:

    • A simple example is to set the variable name test_name to Bob:
      vars:
        dbt_teach_in:
          test_name: "Bob"
    • This variable is available project wide.
  2. Setting variables within .sql scripts

    • A simple example is to set the variable name test_name to Bob:
      {{ " {% set test_name = 'Bob' "}}%}
    • This variable is available only in this script.
    • We can even use Python modules within this jinja contexts:
  3. Calling variables within .sql scripts.

    • Simply use the doubt curly brackets in .sql files, like this:
      {{ " {{ test_name "}}}}
    • For variables assign python modules, you can call its sub method, liek this:
      {{ " {{ now.minute "}}}}
  4. Note that ref and source used are macros called using the jinja:

    • e.g.
      with source_table_name as (
      
          select * from {{ " {{ source('schema_name', 'table_name') "}}}}
      
      )
      
      , model_name as (
      
          select * from {{ " {{ ref('model_name') "}}}}
      
      )

Hooks

There are multiple type of Hooks & Operations dbt can run, see the official dbt documentation

  1. Run Post-Hooks (i.e. after model is materialised)
    • Example:
      {{ " {{  
          config(
              materialized='table',
              post_hook=[ 
                'drop table if exists dbt_experiment.crazy_experiment_fruit_only',
                'create table dbt_experiment.crazy_experiment_fruit_only as (select * from dbt_experiment.crazy_experiment where item_type = ''Fruits'')',
              ]
          ) 
      "}}}}

Macros

Official dbt_ docs:

Documentation

Official dbt_ docs:

  1. Formatting YAML Multiline Strings in descriptions with “>”, “|”, etc.

  2. Using and linking to markdown files with {{ "{% docs <name_of_doc> " }}%}

    • For example declaration, see /analysis/docs/analysis.md
    • For example implementation, see /analysis/analysis.yml:
      analyses:
        - name: <name_of_sql>
          description: '{{ "{{ doc(''<name_of_doc>'') " }}}}'
    • Note: can be used for seeds, analyses, or models too.
  3. Adding Images or Hyperlinks

  4. Formatting Reference

Redshift Configurations

Official dbt_ docs:

  1. Applying to the final fct_ table
    • For example declaration, see /models/reporting/core/fct_orders__dashboard.sql
      -- In `config` here, `sort` and `dist` only works for Redshift and not for Postgres. 
      {{ " {{ 
          config(
              materialized='table',
              sort=[
                  'item_type',
                  'order_id',
                  'order_ref_id',
              ],
              dist='order_id',
          )
      "}}}}

Best Practices

About

A repo with examples for a more advance dbt_ teach-in session

License:GNU General Public License v3.0


Languages

Language:HTML 100.0%