dbt_ Teach-in

Introduction

Introducing (and a refresher of) the more advance dbt_ features and usage. This repo contains the examples used during the teach-in session.

Video Recording of the Teach-in Session

Quick Links
Requirements
Initial Setup
Topics
- Seed
- Test
- Snapshots
- Analyses
- Macros
- Jinja
- Hooks
- Documentation
- Redshift Configurations
- Best Practices

Quick Links

The dbt_ documentation website for this repo!
dbt presentation for Data Council KL
Data Engineering's Confluence page on dbt

Requirements

A Database
- You may either use the Analytics Redshift; or
- Spin up a local Postgres database instance (using docker)
Python
- Version 3.6 or above
Git
- To clone this repository on Github
dbt_
For other Python modules (if any), see:
- requirements.txt

Initial Setup

Prepare the Database
- If you are using the Analytics Redshift, make sure to
  - Set the profile.yml in the dbt project accordingly.
    - Set your user
    - Set your schema
    - Change the target to redshift
  - Export your password to the environment variable:
```
export PASS='<your_password_here>'
```
- If you want to run a local Postgres database instance, you will need to:
  - Install Docker
  - Docker pull and run a local Postgres instance
```
docker run --rm --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 postgres
```
    Tips: add -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data to persist the database.
  - To stop the database when done, run:
```
docker stop pg-docker
```
Installing Python
- Follow the guides to install Python if you don't have it installed already.
Cloning this repository
- If you haven't already, install Git (we recommend installing via Homebrew)
- Create a directory
```
mkdir ~/github
cd ~/github
```
- Clone the repository
```
git clone git@github.com:1bk/dbt-teach-in.git
```
  If you are having trouble with this, please reach out to me.

Installing dbt_

While there are many ways to install dbt_, we suggest to install it in a python virtual environment.

Follow the official dbt_ guide that details each step:

Create a local environment

cd dbt-teach-in                     # set the working directory to the cloned repo
 
python3 -m venv dbt-env             # create the environment
source dbt-env/bin/activate         # activate the environment

Install dbt
```
pip install dbt
```

Topics

The topics here is shared in the order in which it was presented.

Note:

Here we are applying dbt on a local Postgres instance.
Make sure to be in the dbt project directory
```
cd ~/github/dbt-teach-in/dbt_teach_in
```

Seed

Teach-in file reference:

Raw data in /data

Documentation in /data/seeds.yml

Teach-in dbt_ documentation reference:

Table Documentation

Official dbt_ docs:

Seed -- Getting Started

Seed configurations

Seed properties

seed command

Specify a schema to populate the seed files

In dbt_project.yml

seeds:
  dbt_teach_in:
    +schema: seeds
    quote_columns: true

Seed the data
- Making sure the files you want to seed are in /data, run:
```
dbt seed --profiles-dir ./
```

Setting custom columns

In dbt_project.yml

seeds:
  dbt_teach_in:
    ...

    raw_orders:
      # Override column types
      +column_types:
        order_priority: varchar(1)

Enable/Disable seed files:

In dbt_project.yml

seeds:
  dbt_teach_in:
    +schema: seeds

    ...

    snapshots:
      +schema: snapshots

      raw_orders_snapshot_backup:
        +enabled: true

Seeding specific files:

Using --select:

dbt seed --profiles-dir ./ --select raw_orders_snapshot_backup

Tests

Official dbt_ docs:

Tests

Schema Tests

Creating simple schema tests:

In the properties (e.g. schema.yml files), add tests:

For example, in /data/seeds.yml (Yes, you can do it for seeds file):

seeds:
  - name: raw_orders
    description: ...
    columns:
      - name: id
        description: Integer ID of an order entry in this table.
        tests:
          - unique
          - not_null

      - name: order_priority
        tests:
          - accepted_values:
              values: ["C", "H", "M", "L"]

      ...

Creating relationships tests between tables:

In the properties (e.g. schema.yml files), add tests:

For example, in /data/seeds.yml:

seeds:
  - name: raw_orders
    description: ...
    columns:
      ... 

      - name: country_id
        description: Integer ID of a country entry in the `raw_countries` table.
        tests:
          - relationships:
              to: ref('raw_countries')
              field: id

      - name: item_type_id
        description: Integer ID of an item type entry in the `raw_item_types` table.
        tests:
          - relationships:
              to: ref('raw_item_types')
              field: id

Reusing tests using anchors in yaml:

For example, in /models/reporting/staging/reporting_staging.yml.yml:

Define an anchor:

##############################################
## Defining some anchor labels to stay DRY. ##
##############################################

test__boolean_true_only: &test__boolean_true_only
  tests:
    - not_null
    - accepted_values:
        values:
          - true

Using the anchor:

models:
  - name: stg_orders__calc_check
    description: |
      A simple view use to visualy check the reports v.s. calculated:
        - Revenue
        - Cost
        - Profit

      Schema test is executed on the Boolean column and throws error if
      any calculation is off!

    columns:
      - name: same_reveue
        description: ""
        <<: *test__boolean_true_only

      - name: same_cost
        description: ""
        <<: *test__boolean_true_only

      - name: same_profit
        description: ""
        <<: *test__boolean_true_only

But note that Schema test only show if a test PASS or FAILS. To get the number of rows that fails a particular tests, you can use Data Test instead.

Data Tests
1. Creating simple data test:
  - Create a query that would return rows for a specific conditions (in the WHERE clause).
  - If rows are returned, then the test will fail and return an ERROR (default) or WARN (see below)
  - Save the query in the /tests directory
  - See example test_stg_orders__calc_check.sql
2. Defining custom config for each test:
  - By default, a test is enabled and the severity is error.
  - To overwrite this, define custom config in the test's .sql files:
```
{{ "{{
    config(
        enabled=true | false,
        severity='warn' | 'error'
    )
" }}}}
```
  - See example test_stg_orders__calc_check.sql

Snapshots

Use snapshot to track changes to a table overtime. You can also use it to view a snapshot of the table at a specific period in time.

Create a snapshot
1. Follow the official dbt guide
2. For example, see /snapshots/raw_orders_snapshot.sql
To revisit the table at a specific period in time
1. In our example, there was an error in the "units_sold" in the raw_orders table where the number was doubled in the month of June 2017.
2. We use a custom query to put ourselves at the point in time before or after the fix was executed (i.e. the fix was on '2017-07-01').
3. For the custom query, see /analysis/tracing_units_sold_error.sql
4. Documentation of the Query

Analyses

Teach-in file reference:

Documentation in /analysis/analysis.yml

Documentation Docs in /analysis/docs/analysis.md

Teach-in dbt_ documentation reference:

Analysis Documentation

Official dbt_ docs:

Analyses

Basically, if you have a "analytical" queries that you want to version control and also run using dbt but not materialise the table, then you can dump them in the /analysis directory.

Remember, anything in the /analysis directory will not run when you execute dbt run. Instead, it will only be compiled.
Compiling turns the queries in the .sql files to runnable sql you can copy & paste into your SQL clients or Sisense (Periscope)
To compile manually, just run dbt compile.

Jinja

Official dbt_ docs:

Using Jinja

Setting variables in dbt_project.yml:
- A simple example is to set the variable name test_name to Bob:
```
vars:
  dbt_teach_in:
    test_name: "Bob"
```
- This variable is available project wide.
Setting variables within .sql scripts
- A simple example is to set the variable name test_name to Bob:
```
{{ " {% set test_name = 'Bob' "}}%}
```
- This variable is available only in this script.
- We can even use Python modules within this jinja contexts:
  - Here we set the variable called now to the python module datetime
```
{{ " {% set now = modules.datetime.datetime.now() "}}%}
```
  - This example is implemented in models/experiments/crazy_experiment.sql
Calling variables within .sql scripts.
- Simply use the doubt curly brackets in .sql files, like this:
```
{{ " {{ test_name "}}}}
```
- For variables assign python modules, you can call its sub method, liek this:
```
{{ " {{ now.minute "}}}}
```

Note that ref and source used are macros called using the jinja:

e.g.

with source_table_name as (

    select * from {{ " {{ source('schema_name', 'table_name') "}}}}

)

, model_name as (

    select * from {{ " {{ ref('model_name') "}}}}

)

Hooks

There are multiple type of Hooks & Operations dbt can run, see the official dbt documentation

Run Post-Hooks (i.e. after model is materialised)

Example:

{{ " {{  
    config(
        materialized='table',
        post_hook=[ 
          'drop table if exists dbt_experiment.crazy_experiment_fruit_only',
          'create table dbt_experiment.crazy_experiment_fruit_only as (select * from dbt_experiment.crazy_experiment where item_type = ''Fruits'')',
        ]
    ) 
"}}}}

Macros

Official dbt_ docs:

Macros

Documentation

Official dbt_ docs:

Documentation

Formatting YAML Multiline Strings in descriptions with “>”, “|”, etc.
Using and linking to markdown files with {{ "{% docs <name_of_doc> " }}%}
- For example declaration, see /analysis/docs/analysis.md
- For example implementation, see /analysis/analysis.yml:
```
analyses:
  - name: <name_of_sql>
    description: '{{ "{{ doc(''<name_of_doc>'') " }}}}'
```
- Note: can be used for seeds, analyses, or models too.
Adding Images or Hyperlinks
Formatting Reference

Redshift Configurations

Official dbt_ docs:

Redshift Configurations

Applying to the final fct_ table

For example declaration, see /models/reporting/core/fct_orders__dashboard.sql

-- In `config` here, `sort` and `dist` only works for Redshift and not for Postgres. 
{{ " {{ 
    config(
        materialized='table',
        sort=[
            'item_type',
            'order_id',
            'order_ref_id',
        ],
        dist='order_id',
    )
"}}}}

1bk / dbt-teach-in

dbt_ Teach-in

Introduction

Video Recording of the Teach-in Session

Contents

Quick Links

Requirements

Initial Setup

Topics

Seed

Tests

Snapshots

Analyses

Jinja

Hooks

Macros

Documentation

Redshift Configurations

Best Practices

About

Languages