mattsfuller / dbt-galaxy-covid-demo

dbt + Trino demo project, using Starburst Galaxy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dbt + Trino: Starburst Galaxy COVID-19 tutorial!

There's a non-insignificant amount of setup work. The entire value prop of Trino and Galaxy is to be able to grab and transform data regardless of where it is. To demo this, you have to create at least one place for data to be and put data into it. Then you must set up a Galaxy account and give it access to the external data stores as well as where output data will be stored. The silver lining is that you only have to do this once, ever!

For the data source setup required for this tutorial, please see INFRA_SETUP.MD.

This demo can be utilized for either dbt Core or dbt Cloud. Both will require you to complete the steps in INFRA_SETUP.MD to set up the appropriate data sources.

What you'll need:

Why are we using so many data sources? Well, for this data lakehouse tutorial we will take you through all the steps of creating a reporting structure, including the steps to get your sources into your land layer in S3. Starburst Galaxy's superpower with dbt is being able to federate data from multiple different sources into one dbt repository. Showing multiple sources helps demonstrate this use case in addition to the data lakehouse use case. If you are interested in only using S3, you can run all the TPCH and AWS models without having to create a snowflake login. The snowflake section will fail, but the rest should complete.

Screenshot-2023-05-05-at-10 11 05-AM

You will also need:

  • A dbt installation of your choosing (core or cloud).
  • For core: I used a virtual environment on my M1 mac because that was the most recommended. I'll add the steps below in this readme. Review the other dbt core installation information to pick what works best for you.
  • For Cloud: I registered for a free account and utilized this repository in dbt Cloud. This option requires less first time setup steps. If you don't know what to pick, use this.

Tutorial Information

The goal of this tutorial is to showcase the power of dbt + Starburst Galaxy together. This tutorial aims to demonstrate both superpowers.

  1. Query federation across multiple data sources - dbt specializes as a transform tool and can only be utilized after the data is landed in a storage solution. Starburst Galaxy fixes that by allowing you to query your data from multiple sources.
  2. Data Lakehouse analytics - In this lab, we are going to build our lakehouse reporting structure in S3 and use slightly different naming conventions from the traditional Land, Structure, and Consume layer to accomodate for dbt standards. Land = Stage, Structure = Intermediate, Consume = Aggregate. For more information about the Starburst data lakehouse, visit this blog.

dbt Core

For the dbt Core tutorial, visit this blog for more information. Use the CORE.MD as a README to run this demo using dbt Core.

dbt Cloud

For the dbt Cloud tutorial, visit this blog for more information. Use the CLOUD.MD as a README to run this demo using dbt Cloud.

Shoutouts

Shout out to @dataders for his awesome help! Inspired by the Cinco de Trino repo by @jtcohen6!

About

dbt + Trino demo project, using Starburst Galaxy