databricks glue pyspark data-engineering github-actions

Multi-cloud ETL Pipeline

Objective

To run the same ETL code in multiple cloud services based on your preference, thus saving time.
To develop ETL scripts for different environments and clouds.

Note

This repository currently supports Azure Databricks + AWS Glue.
Azure Databricks can't be configured locally, We can only connect our local IDE to running cluster in databricks. It works by pushing code in a Github repository then adding a workflow in databricks with URL of the repo & file.
For AWS Glue we will set up a local environment using glue Docker image, then deploying it to AWS glue using github actions.
The "tasks.txt" file contains the details of transformations done in the main file.

Pre-requisite

Python3.7 with PIP
AWS CLI configured locally

# Make sure to export JAVA_HOME like this:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home

Quick Start

Clone this repo (for Windows use WSL).
For setting up required libraries and packages locally, run:

    # If default SHELL is zsh use
    make setup-glue-local SOURCE_FILE_PATH=~/.zshrc

    # If default SHELL is bash use
    make setup-glue-local SOURCE_FILE_PATH=~/.bashrc

Source SHELL profile using:

    # For zsh
    source ~/.zshrc

    # For bash
    source ~/.bashrc

Install Dependencies:

    make install

Change Your Paths

Enter your S3, ADLS & Kaggle (optional) paths in the app/.custom_env file for Databricks. This file will be used by Databricks.
Similarly, we'll make .evn file in the root folder. This file will be used by local glue job. To create the required files run:

    make glue-demo-env

This command will copy your paths from in the .env file.

(Optional) If you want to extract from kaggle, enter KAGGLE_KEY & KAGGLE_USERNAME in .evn file only. Note: Don't enter any sensitive keys in app/.custom_env file.

Setup Check

Finally, check if everything is working correctly by running:

    gluesparksubmit jobs/demo.py

Ensure "Execution Complete" is printed.

Make New Jobs

Write your jobs in the jobs folder. Refer demo.py file. One example is the jobs/main.py file.

Deployment

Set up a Github action for AWS Glue. Make sure to pass the following secrets in your repository:

    AWS_ACCESS_KEY_ID
    AWS_SECRET_ACCESS_KEY
    S3_BUCKET_NAME
    S3_SCRIPTS_PATH
    AWS_REGION
    AWS_GLUE_ROLE

Rest all the key-value pairs that entered in the .env file. make sure to pass them using automation/deploy_glue_jobs.sh file.

For Azure Databricks, make a workflow with the link to your repo & main file. Pass the following parameters with their correct values:

    kaggle_username
    kaggle_token
    storage_account_name
    datalake_access_key

Run Tests & Coverage Report

To run tests & coverage report, run the following commands in the root folder of the project:

    make test

    # To see the coverage report
    make coverage-report

References

Glue Programming libraries

About

To run the same ETL code in multiple cloud services based on your preference, thus saving time further to develop the ETL scripts for different environments & clouds.

https://www.wednesday.is

databricks glue pyspark data-engineering github-actions

Languages

Language:Python 88.8%Language:Shell 10.4%Language:Makefile 0.9%