yu-iskw / dbt-artifacts-loader

Load dbt artifacts uploaded to GCS to BigQuery in order to track historical dbt results

Home Page:https://yu-ishikawa.medium.com/data-status-time-machine-on-persisted-dbt-artifacts-a745bc753db3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dbt-artifacts-loader

It enables us to persist dbt artifacts uploaded on GCS to BigQuery with Cloud Run.

Overview

How to use

  1. We push a docker image to Google Container Registry
  2. We apply the terraform resources to launch the Cloud Run application.
  3. We upload dbt artifacts JSON files to Google Cloud Storage.
  4. We create the dbt models to efficiently analyze dbt artifacts.

Requirements

  • docker
  • terraform
  • dbt-bigquery

Push a docker image to Google Container Registry

Cloud Run accepts only a docker image on Google Cloud as Google Container Registry and Artifacts Registry.

export project="YOUR-PROJECT-ID"
docker build --rm -f Dockerfile  -t "gcr.io/${YOUR-PROJECT-ID}/dbt-artifacts-loader:latest" .
docker push "gcr.io/${YOUR-PROJECT-ID}/dbt-artifacts-loader:latest"

Apply the terraform resources to launch the Cloud Run application.

The terraform/example directory enables us to create required GCP resources and launch the Cloud Run application. It creates:

  • a GCS bucket to store dbt artifacts JSON files,
  • a BigQuery dataset and tables to store dbt artifacts,
  • a Pus/Sub topic and a corresponding Pub/Sub subscription to notify file uploads to Cloud Run, and
  • a Cloud Run application to load uploaded dbt artifacts to BigQuery.

The terraform project asks us to specify a GCP project ID to allocate the resources and a location of GCS, BigQuery and Cloud Run. terraform apply shows the prompt to enter the two. If you want to fix them, please edit the default values in variables.tf.

cd terraform/example
terraform init
terraform apply

Upload dbt artifacts JSON files to Google Cloud Storage.

We have set up the Cloud Run application to load dbt artifacts JSON files to BigQuery. Now, let's upload a run_results.json to GCS manually.

gsutil cp "target/run_results.json" "gs://${project}-dbt-artifacts/"

The function which the repository doesn't contain is to upload dbt artifacts JSON file. I would recommend you automate uploading JSON files to GCS after executing dbt.

Create the dbt models to efficiently analyze dbt artifacts.

We have almost done. The last thing we have to do is to create BigQuery tables with dbt, so that we efficiently analyze dbt artifacts.

To do that, we have to edit:

cd dbt_artifacts_models
# Set up the environment for dbt
make setup
# Create and test the dbt models
make run && make test

Links

About

Load dbt artifacts uploaded to GCS to BigQuery in order to track historical dbt results

https://yu-ishikawa.medium.com/data-status-time-machine-on-persisted-dbt-artifacts-a745bc753db3

License:Apache License 2.0


Languages

Language:Python 93.8%Language:HCL 3.2%Language:Shell 2.7%Language:Makefile 0.3%Language:Dockerfile 0.0%