This project is an ETL data analysis using data engineering principles (CI/CD) that uses Python and PostgreSQL to extract, transform, and load data from CSV files in an AWS S3 bucket into a PostgreSQL database. The transformed data is then exported back to the S3 bucket as CSV files.
For CI/CD Github action is used, Makefile is used for Continuous Integration of the codes while terraform for Continuous Delivery to meet DevOps best practice.
- Bucket Name: d2b-internal-assessment-bucket
- Data Locations: s3://d2b-internal-assessment-bucket/orders_data/*
- Database HOST d2b-internal-assessment-dwh.cxeuj0ektqdz.eu-central-1.rds.amazonaws.com
- DB_NAME d2b_assessment
This project uses the ETL (extract, transform, load) model for data analysis. Three CSV files are imported using the Python boto3 library from an AWS S3 bucket and saved into a folder named import_files. The saved CSV files are then loaded into a PostgreSQL database in a schema specific to user nasibell8682_staging.
NB: The shipment_deliveries is the file name for the shipments_deliveries.csv file which was stated wrongly in the brief
orders.csv
: A fact table that contains information about orders placed by customers, including theorder_id
,customer_id
,order_date
,product_id
,unit_price
,quantity
, andamount
reviews.csv
: A fact table that contains customer reviews for products, including thereview
and theproduct_id
being reviewed.shipments_deliveries.csv
: A fact table that contains information about shipments and deliveries for orders, including theshipment_id
,order_id
,shipment_date
, anddelivery_date
.
The psycopg2 library is used to connect to the database and load the CSV files into the database by creating a table for each file.
The if_common
schema contains four dimension tables that are used to enrich the data in the nasibell8682_staging
schema during the transformation process. These tables are:
dim_customers
: Contains information about customers, such as theircustomer_id
,customer_name
, andpostal_code
.dim_dates
: Contains information about dates, such as thecalendar_dt
,year_num
,month_of_the_year_num
,day_of_the_month_num
,day_of_the_week_num
, andworking_day
fields.dim_addresses
: Contains information about addresses, such as thecountry
,region
,state
, andaddress
.dim_product
: Contains information about products, such as theirproduct_id
,product_category
, andproduct_name
.
These dimension tables are used together with the fact tables in the nasibell8682_staging
schema to create the transformed tables in the nasibell8682_analytics
schema.
The repository contains the following files:
main.py
: Handles the extraction from the S3 bucket and loading to PostgreSQL.main.tf
: Deploys export.py to export the files.export.py
: Handles the exporting of the tables from nasibell8682_analytics to the S3 bucket.scripts.sql
: Contains the SQL queries used for transformation.requirements.txt
: Lists the dependencies for the project.Makefile
: TheMakefile
in this project contains a set of commands that can be run to lint and format the.py
extension files. To run theMakefile
, use the commandmake <target>
, where<target>
is the name of the target you want to run. For example, to lint the.py
extension files, you can use the commandmake lint
.
To run this project, you will need to have the following installed:
- Python
- PostgreSQL
- Terraform The dependencies listed in requirements.txt
- Set up my AWS account and PostgreSQL database.
- Install the dependencies using
pip install -r requirements.txt
. - run
Makefile
to lint and format the.py
extension files for continuous integration CI - Run
main.py
to extract the CSV files from the S3 bucket and load them into thenasibell8682_staging
schema in PostgreSQL. - Run the SQL queries in
scripts.sql
to transform the data and create the tables in thenasibell8682_analytics
schema. - Deploy
export.py
usingmain.tf
to export the tables innasibell8682_analytics
to the S3 bucket.
In main.tf
the code doesn't include any provision for IAM Role
or Policy
which are requirements to successfuly provision using terraform, as a user with no permission to create a role or access any role within the organization AWS account, a personal account was used to create a role and added the required policy i.e AmazonS3FullAccess
and AmazonRDSFullAccess
to successful deploy using terraform. The ARN of the role created in the user's AWS account is what was used to create the lambda function used for the deployment. Github secrets is where the AWS credential are stored for the terraform configuration.
To deploy using main.tf
terraform file, three commands must be run using any CLI which are;
terraform init
prepares the working directory so Terraform can run the configuration- `terraform plan' lets you preview any changes before you apply them.
terraform apply
executes the changes defined by the Terraform configuration to create, update, or destroy resources.
Terraform was used to deploy the export.py
while the main.py
was deploy using CLI command python main.py