moritzkoerber / covid-19-data-engineering-pipeline

A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

Geek Repo

Github PK Tool

apache-airflow apache-spark api aws aws-cdk aws-cloudformation aws-ecr aws-glue aws-lambda aws-redshift aws-s3 docker great-expectations pyspark spark

This repo is my playground to try out various data engineering stuff. The used services/tools/design is not always the best choice or sometimes unnecessary cumbersome – this just reflects me trying to explore different things. At the moment, the pipeline processes Covid-19 data as follows: All infrastructure is templated in AWS CloudFormation or AWS CDK. All steps feature an alarm on failure. The stack can be deployed via Github Actions. I use poetry to manage the dependencies/virtual environment.

About

A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

apache-airflow apache-spark api aws aws-cdk aws-cloudformation aws-ecr aws-glue aws-lambda aws-redshift aws-s3 docker great-expectations pyspark spark

MIT License

Languages

Language:Python 76.9%Language:TypeScript 12.3%Language:Dockerfile 8.6%Language:JavaScript 1.6%Language:Shell 0.6%