agnivchtj / dbt-pipeline-project

Building a data pipeline using DBT and Snowflake to load sample TPCH data and perform basic data modeling techniques, such as building data marts, fact tables, macros and tests.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Building an ELT pipeline with DBT Core and Snowflake

In this project we are building a ELT pipeline using DBT and Snowflake. The main objective is to load data from a source (i.e. TPCH_SF1 dataset) from Snowflake and perform some basic data modelling techniques, such as building data marts, fact tables, macros and tests.

Project setup

The following schema dbt_schema was set up in a worksheet in Snowflake and under database dbt_database:

%%initializing resources
use role accountadmin;

create warehouse dbt_warehouse with warehouse_size='x-small';
create database dbt_database;
create role dbt_role;

show grants on warehouse dbt_warehouse;

grant usage on warehouse dbt_warehouse to role dbt_role;
grant all on database dbt_database to role dbt_role;
grant role dbt_role to user agnivchtj;

use role dbt_role;

create schema dbt_database.dbt_schema;

To run the project with DBT we run dbt init and enter details such as project name (data_pl) and Snowflake account identifier as well as the names for the role, warehouse, database and schema we have created. Once authentication is successfully done we can make changes within DBT and these are then reflected in dbt_schema on Snowflake (as tables, views).

In our dbt_project.yml file we instruct DBT to build all source tables in staging folder as views in Snowflake, while tables derived from these are saved as marts:

models:
  data_pl:
    # Config indicated by + and applies to all files under models/*/
    staging:
      +materialized: view
      snowflake_warehouse: dbt_warehouse
    marts:
      +materialized: table
      snowflake_warehouse: dbt_warehouse

Then, we specify the data sources that we are using (in models/staging/tpch_sources.yml) and provide details such as name, database, schema and the tables:

version: 2
sources:
  - name: tpch
    database: snowflake_sample_data
    schema: tpch_sf1
    tables:
      - name: orders
        columns:
          - name: o_orderkey
            tests:
              - unique
              - not_null
      - name: lineitem
        columns:
          - name: l_orderkey
            tests:
              - relationships:
                  to: source('tpch', 'orders')
                  field: o_orderkey

Setting up sources

Within the sample data in Snowflake, there are 2 tables that we make use of: orders and lineitem. These tables are specified as our source, where column 'l_orderkey' (in lineitem table) is a foreign key in the orders table. Using the source, we can create source tables that contain the fields desired.

For orders, we create a source table including fields such as 'o_orderkey', 'o_custkey', 'o_orderstatus', 'o_totalprice' etc. For lineitem, we create a source table comprising fields such as 'l_orderkey', 'l_linenumber', 'l_partkey', 'l_suppkey', 'l_extendedprice', 'l_discount' etc. These can be found in the models/staging/ folder.

These models can be run and become visible as views in Snowflake:

dbt run -s staging_tpch_orders
dbt run -s staging_tpch_line_items

Applying business transformations to staging tables

Using the staging tables as reference, we can apply transformations to the data and format it such that it conveys something for business use. In dimensional modeling, these are referred to as 'fact tables'.

For example, int_order_items.sql provides an overview of line items for orders with the discounted price computed. The latter calculation can be found as a function in the macros folder. int_order_items_summary.sql then uses that fact table as reference to provide a total summary of prices for orders.

Writing tests

In dbt, there are two ways of defining tests:

  • A singular test is testing in its simplest form, where you test for a single use case: so for example, writing a SQL query that returns failing rows.
  • A generic test is a parameterized query that accepts arguments, and is defined in a special block (like a macro function). It can be referenced later for models, columns, sources, snapshots etc.

We wrote tests to check for the following conditions:

  • Discounts are valid: this is by verifying that there are no rows with total price lower than the discounted price.
  • Valid order date: this verifies there are no rows where order date is either in the future or before 1990.

Singular tests can be run simply with the dbt test command.

About

Building a data pipeline using DBT and Snowflake to load sample TPCH data and perform basic data modeling techniques, such as building data marts, fact tables, macros and tests.


Languages

Language:Python 87.1%Language:C 7.3%Language:HTML 4.9%Language:C++ 0.5%Language:PowerShell 0.1%Language:PLpgSQL 0.0%Language:Cython 0.0%Language:JavaScript 0.0%Language:Batchfile 0.0%Language:Shell 0.0%