johnkdunyo / lmd-aws-cdk-pipelines-datalake-etl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CDK Pipelines for Data Lake ETL Deployment

This solution helps you deploy ETL processes on data lake using AWS CDK Pipelines. This is based on AWS blog Deploy data lake ETL jobs using CDK Pipelines. We recommend you to read the blog before you proceed with the solution.

CDK Pipelines is a construct library module for painless continuous delivery of CDK applications. CDK stands for Cloud Development Kit. It is an open source software development framework to define your cloud application resources using familiar programming languages.

This solution helps you to:

  1. deploy ETL jobs on data lake
  2. build CDK applications for your ETL workloads
  3. deploy ETL jobs from a central deployment account to multiple AWS environments such as dev, test, and prod
  4. leverage the benefit of self-mutating feature of CDK Pipelines. For example, whenever you check your CDK app's source code in to your version control system, CDK Pipelines can automatically build, test, and deploy your new version
  5. increase the speed of prototyping, testing, and deployment of new ETL jobs

Contents

Data lake

In this section we talk about Data lake architecture and its infrastructure.


Architecture

To level set, let us design a data lake. As shown in the figure below, we use Amazon S3 for storage. We use three S3 buckets - 1) raw bucket to store raw data in its original format 2) conformed bucket to store the data that meets the quality requirements of the lake 3) purpose-built data that is used by analysts and data consumers of the lake.

The Data Lake has one producer which ingests files into the raw bucket. We use AWS Lambda and AWS Step Functions for orchestration and scheduling of ETL workloads.

We use AWS Glue for ETL and data cataloging, Amazon Athena for interactive queries and analysis. We use various AWS services for logging, monitoring, security, authentication, authorization, notification, build, and deployment.

Note: AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. These two services are not used in this solution.

Conceptual Data Lake


Infrastructure

Now we have the Data Lake design, let's deploy its infrastructure. You can use AWS CDK Pipelines for Data Lake Infrastructure Deployment for this purpose.


ETL use case

To demonstrate the above benefits, we will use NYC Taxi and Limousine Commission Data and build a sample ETL process for this. In our Data Lake, we have three S3 buckets - Raw, Conformed, and Purpose-built.

Figure below represents the infrastructure resources we provision for Data Lake.

  1. Data in json(via S3 put method), csv, etc. is uploaded to the raw S3 bucket from a data producer/source. The data producer could be a file server, ETL processes integrating data from external sources, manual files uploads, etc.
  2. S3 event-notification(via S3 put event) triggers a lambda function.
  3. The AWS Lambda function inserts an item in DynamoDB table.
  4. AWS Lambda function Starts an execution of AWS Step Functions State machine
  5. Step functions intiates the raw_to_staging_job which converts the uploaded data into parquet format – Initiate glue job in sync mode
  6. The raw_to_staging_job loads the formatted data into the respective S3 bucket.
  7. The raw_to_staging_job updates the data catalogue with the new data
  8. Step functions intiates the staging_to_curated_job which transforms the parquet data using the SQL scripts provided for that dataset
  9. The staging_to_curated_job writes to S3 with the tranformed data
  10. The staging_to_curated_job updates the catalogue with the details about the transformed data
  11. Step functions intiates the curated_to_redshift_job which is responsible for loading the curated data into AWS redshift
  12. The curated_to_redshift_job reads the S3 bucket via the updated catalogue
  13. The curated_to_redshift_job loads data into redshift after deduplicating it first
  14. Dynamo DB is updated with the results of the step function run
  15. Quicksight can be connected directly to the warehouse and visualisations done.
  16. An SNS notification via email is sent to the data engineering team channel on the status of the data pipeline

Data Lake Infrastructure Architecture


The solution

We use a centralized deployment model to deploy data lake infrastructure across dev, test, and prod environments.


Centralized deployment

Let us see how we deploy data lake ETL workloads from a central deployment account to multiple AWS environments such as dev, test, and prod. As shown in the figure below, we organize Data Lake ETL source code into three branches - dev, test, and production. We use a dedicated AWS account to create CDK Pipelines. Each branch is mapped to a CDK pipeline and it turn mapped to a target environment. This way, code changes made to the branches are deployed iteratively to their respective target environment.

Alt


Continuous delivery of ETL jobs using CDK Pipelines

Figure below illustrates the continuous delivery of ETL jobs on Data Lake.

Alt

There are few interesting details to point out here:

  1. The DevOps administrator checks in the code to the repository.
  2. The DevOps administrator (with elevated access) facilitates a one-time manual deployment on a target environment. Elevated access includes administrative privileges on the central deployment account and target AWS environments.
  3. CodePipeline periodically listens to commit events on the source code repositories. This is the self-mutating nature of CodePipeline. It’s configured to work with and is able to update itself according to the provided definition.
  4. Code changes made to the main branch of the repo are automatically deployed to the dev environment of the data lake.
  5. Code changes to the test branch of the repo are automatically deployed to the test environment.
  6. Code changes to the prod branch of the repo are automatically deployed to the prod environment.

Source code structure

Table below explains how this source ode structured:

File / Folder Description
app.py Application entry point
pipeline_stack Pipeline stack entry point
pipeline_deploy_stage Pipeline deploy stage entry point
glue_stack Stack creates Glue Jobs and supporting resources such as Connections, S3 Buckets - script and temporary - and an IAM execution Role
step_functions_stack Stack creates an ETL State machine which invokes Glue Jobs and supporting Lambdas - state machine trigger and status notification.
dynamodb_stack Stack creates DynamoDB Tables for Job Auditing and ETL transformation rules.
Glue Scripts Glue spark job data processing logic for conform and purpose built layers
ETL Job Auditor lambda script to update dynamodb in case of glue job success or failure
ETL Trigger lambda script to trigger step function and initiate dynamodb
ETL Transformation SQL Transformation logic to be used for data processing from conformed to purpose-built
Resources This folder has architecture and process flow diagrams

Deployment

This section provides deployment instructions.


Setup infrastructure and bootstrap AWS accounts

This project is dependent on the AWS CDK Pipelines for Data Lake Infrastructure Deployment. Please reference the Prerequisites section in README.

Deploying for the first time

Configure your AWS profile to target the central Deployment account as an Administrator and perform the following steps:

  1. Open command line (terminal)

  2. Go to project root directory where cdk.json and app.py exist

  3. Run the command cdk ls

  4. Expected output. You will see the following CloudFormation stack names listed on your terminal

    DevDataLakeCDKBlogEtlPipeline
    ProdDataLakeCDKBlogEtlPipeline
    TestDataLakeCDKBlogEtlPipeline
    DevDataLakeCDKBlogEtlPipeline/Dev/DevDataLakeCDKBlogEtlDynamoDb
    DevDataLakeCDKBlogEtlPipeline/Dev/DevDataLakeCDKBlogEtlGlue
    DevDataLakeCDKBlogEtlPipeline/Dev/DevDataLakeCDKBlogEtlStepFunctions
    ProdDataLakeCDKBlogEtlPipeline/Prod/ProdDataLakeCDKBlogEtlDynamoDb
    ProdDataLakeCDKBlogEtlPipeline/Prod/ProdDataLakeCDKBlogEtlGlue
    ProdDataLakeCDKBlogEtlPipeline/Prod/ProdDataLakeCDKBlogEtlStepFunctions
    TestDataLakeCDKBlogEtlPipeline/Test/TestDataLakeCDKBlogEtlDynamoDb
    TestDataLakeCDKBlogEtlPipeline/Test/TestDataLakeCDKBlogEtlGlue
    TestDataLakeCDKBlogEtlPipeline/Test/TestDataLakeCDKBlogEtlStepFunctions
  5. Before you bootstrap central deployment account account, set environment variable

    export AWS_PROFILE=replace_it_with_deployment_account_profile_name_b4_running
  6. Run the command cdk deploy --all

  7. Expected outputs:

    1. In deployment account, the following CodePipelines created successfully

      Alt

    2. In Dev environment's CloudFormation console, the following stacks created successfully

      Alt


Iterative Deployment

Pipeline you have created using CDK Pipelines module is self mutating. That means, code checked to GitHub repository branch will kick off CDK Pipeline mapped to that branch.


Testing

This section provides testing instructions.


Prerequisites

Below lists steps are required before starting the job testing:

  1. Note: We use New York City TLC Trip Record Data.

  2. Download Yellow Taxi Trip Records for August-2020

  3. Make sure the transformation logic is entered in dynamodb for <> table. As part of job creation mentioned transformation logic will be used to transform data from raw to conform:

    SELECT count(*) count, coalesce(vendorid,-1) vendorid, day, month, year, pulocationid, dolocationid, payment_type, sum(passenger_count) passenger_count, sum(trip_distance) total_trip_distance, sum(fare_amount) total_fare_amount, sum(extra) total_extra, sum(tip_amount) total_tip_amount, sum(tolls_amount) total_tolls_amount, sum(total_amount) total_amount
    FROM datalake_raw_source.yellow_taxi_trip_record
    GROUP BY vendorid, day, month, year, day, month, year, pulocationid, dolocationid, payment_type;
  4. Create a folder under raw bucket {target_environment.lower()}-{resource_name_prefix}-{self.account}-{self.region}-raw root path, this folder name will be used as source_system_name. You can use tlc_taxi_data or name of your choice.

  5. Go to the created folder and create child folder named yellow_taxi_trip_record or you can name it per your choice

  6. Configure Athena workgroup before you run queries via Amazon Athena. For more details, refer Setting up Athena Workgroups.


Steps for ETL testing

  1. Go to raw S3 bucket and perform the following steps:

    1. create a folder with name tlc_taxi_data and go to it
    2. create a folder with name yellow_taxi_trip_record and go to it
    3. upload the file yellow_tripdata_2020-01.csv
  2. Upon successful load of file S3 event notification will trigger the lambda

  3. Lambda will insert record into the dynamodb table {target_environment.lower()}-{resource_name_prefix}-etl-job-audit to track job start status

  4. Lambda function will trigger the step function. Step function name will be <filename>-<YYYYMMDDHHMMSSxxxxxx> and provided the required metadata input

  5. Step functions state machine will trigger the Glue job for Raw to Conformed data processing.

  6. Glue job will load the data into conformed bucket using the provided metadata and data will be loaded to s3://{target_environment.lower()}-{resource_name_prefix}-{self.account}-{self.region}-conformed/tlc_taxi_data/yellow_taxi_trip_record/year=YYYY/month=MM/day=DD in parquet format

  7. Glue job will create/update the catalog table using the tablename passed as parameter based on folder name yellow_taxi_trip_record as being mentioned in prerequisites

  8. After raw to conform job completion purpose-built glue job will get triggered in step function

  9. Purpose built glue job will use the transformation logic being provided in dynamodb as part of prerequisites for data transformation

  10. Purpose built glue job will store the result set in S3 bucket under s3://{target_environment.lower()}-{resource_name_prefix}-{self.account}-{self.region}-purposebuilt/tlc_taxi_data/yellow_taxi_trip_record/year=YYYY/month=MM/day=DD

  11. Purpose built glue job will create/update the catalog table

  12. After completion of glue job lambda will get triggered in step function to update the dynamodb table {target_environment.lower()}-{resource_name_prefix}-etl-job-audit with latest status

  13. SNS notification will be sent to the subscribed users

  14. To validate the data, please open Athena service and execute query. For testing purpose below mentioned query is being used

    SELECT * FROM "datablog_arg"."yellow_taxi_trip_record" limit 20;
  15. For testing of second data source, Download Green Taxi Trip Records for August-2020

  16. Perform the prerequisites for second source, where create child folder yellow_taxi_trip_record under could be tlc_taxi_data in s3://{target_environment.lower()}-{resource_name_prefix}-{self.account}-{self.region}-raw

  17. For dynamodb transformation logic you can use the below mentioned query:

    SELECT count(*) count, coalesce(vendorid,-1) vendorid, day, month, year, pulocationid, dolocationid, payment_type, sum(passenger_count) passenger_count, sum(trip_distance) total_trip_distance, sum(fare_amount) total_fare_amount, sum(extra) total_extra, sum(tip_amount) total_tip_amount, sum(tolls_amount) total_tolls_amount, sum(total_amount) total_amount
    FROM datalake_raw_source.green_taxi_record_data
    GROUP BY vendorid, day, month, year, day, month, year, pulocationid, dolocationid, payment_type

Additional resources

In this section, we provide some additional resources.


Clean up

  1. Delete stacks using the command cdk destroy --all. When you see the following text, enter y, and press enter/return.

    Are you sure you want to delete: TestDataLakeCDKBlogEtlPipeline, ProdDataLakeCDKBlogEtlPipeline, DevDataLakeCDKBlogEtlPipeline (y/n)?

    Note: This operation deletes stacks only in central deployment account

  2. To delete stacks in development account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:

    1. Dev-DevDataLakeCDKBlogEtlDynamoDb
    2. Dev-DevDataLakeCDKBlogEtlGlue
    3. Dev-DevDataLakeCDKBlogEtlStepFunctions
  3. To delete stacks in test account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:

    1. Test-TestDataLakeCDKBlogEtlDynamoDb
    2. Test-TestDataLakeCDKBlogEtlGlue
    3. Test-TestDataLakeCDKBlogEtlStepFunctions
  4. To delete stacks in prod account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:

    1. Prod-ProdDataLakeCDKBlogEtlDynamoDb
    2. Prod-ProdDataLakeCDKBlogEtlGlue
    3. Prod-ProdDataLakeCDKBlogEtlStepFunctions
  5. For more details refer to AWS CDK Toolkit


AWS CDK

Refer to cdk_instructions.md for detailed instructions


Developer guide

Refer to Developer Guide for more information on this project


Authors

The following people are involved in the design, architecture, development, and testing of this solution:

  1. Isaiah Grant, Cloud Consultant, 2nd Watch, Inc.
  2. Muhammad Zahid Ali, Data Architect, Amazon Web Services
  3. Ravi Itha, Senior Data Architect, Amazon Web Services

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

About

License:MIT No Attribution


Languages

Language:Python 99.6%Language:Batchfile 0.4%