alexott / dlt-files-in-repos-demo

Demonstration of using Files in Repos with Databricks Delta Live Tables

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This repository contains a demo of using Files in Repos functionality with Databricks Delta Live Tables (DLT) to perform unit & integration testing of DLT pipelines.

The development workflow

The development workflow is organized as on following image:

DLT development workflow

More detailed description is available in the blog post Applying software development & DevOps best practices to Delta Live Table pipelines.

Setup instructions

🚧 Work in progress...

⚠️ Setup instructions describe process of performing CI/CD using Azure DevOps (ADO), but similar thing could be implemented with any CI/CD technology.

There are two ways of setting up everything:

  1. using Terraform - it's the easiest way of getting everything configured in a short time. Just follow instructions in terraform/azuredevops/ folder. ⚠️ This doesn't include creation of release pipeline as there is no REST API and Terraform resource for it.
  2. manually - follow instructions below to create all necessary objects.

Create necessary Databricks Repos checkouts

In this example we're using three checkouts of our sample repository:

  1. Development: is used for actual development of the new code, running tests before committing the code, etc.
  2. Staging: will be used to run tests on commits to branches and/or pull requests. This checkout will be updated to the actual branch to which commit happened. We're using one checkout just for simplicity, but in real-life we'll need to create such checkouts automatically to allow multiple tests to run in parallel.
  3. Production: is used to keep the production code - this checkout always will be on the releases branch, and will be updated only when commit happens to that branch and all tests are passed.

Here is an example of repos created with Terraform:

Databricks repos

Create DLT pipelines

We need to create a few DLT pipelines for our work:

  1. for main code that is used for development - use only pipelines/DLT-Pipeline.py notebook from the development repository.
  2. (optional) for integration test that could be run as part of development - from the development repository use main code notebook (pipelines/DLT-Pipeline.py) together with integration test notebook (tests/integration/DLT-Pipeline-Test.py).
  3. for integration test running as part of CI/CD pipeline - similar to the previous item, but use the staging repository.
  4. for production pipeline - use only pipelines/DLT-Pipeline.py notebook from the production repository.

Here is an example of pipelines created with Terraform:

Databricks repos

Create Databricks cluster

If you decide to run notebooks with tests located in tests/unit-notebooks directory, you will need to create a Databricks cluster that will be used by the Nutter library. To speedup tests, attach the nutter & chispa libraries to the created cluster.

If you don't want to run these tests, comment out in the azure-pipelines.yml the block with displayName "Execute Nutter tests".

Create ADO build pipeline

🚧 Work in progress...

The ADO build pipeline consists of the two stages:

  • onPush is executed on push to any Git branch except releases branch and version tags. This stage only runs & reports unit tests results (both local & notebooks).
  • onRelease is executed only on commits to the releases branch, and in addition to the unit tests it will execute a DLT pipeline with integration test (see image).

Stages of ADO build pipeline

Create ADO release pipeline

🚧 Work in progress...

About

Demonstration of using Files in Repos with Databricks Delta Live Tables

License:MIT License


Languages

Language:HCL 57.7%Language:Python 42.3%