Yan-Luo-AU / Data_Engineer_Project_ETL_BI

This is an ETL project - extracting data from an ecommerce transactional database on RDS, transforming the data using AWS glue job, and loading it to a Redshift data warehouse, and connected it to Tableau for BI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Engineer Project - Syn transactional data to data warehouse for BI

  • Create MySQL database instance on RDS as transactional database
  • Create a Data Warehouse in Redshift, and connect it to DBeaver
  • AWS DataPipeline: one-time load historical data into Redshift tables using copy command
  • AWS Glue: incremental data loads into Redshift
    • use AWS Secret Managers to store credentials
    • create a schema for staging tables in Redshift, and create staging tables
    • copy current data from S3 and store into a staging table
    • delete from the main table in the Data Warehouse using the staging tables
    • insert into the main table in the Data Warehouse by selecting everything from the staging table, and truncate the staging table at the end of this process.
    • use Lambda function to trigger the Glue job
  • Data Enrichment & Centralisation
    • User behaviour/Funnel data from external sources that is semi-structured can be stored in S3 (data lake)
    • use AWs Glue Crawler and Athena to read and query the data
    • the AWS Glue Crawler read the timestamp column into string format so a Glue job in pySpark is used to convert it into timestamp and also partition the timestamp into year and month
    • Add a Lambda function to trigger the Glue job so when the data arrives in S3 the ETL process can be triggered.
    • Imported the external data into Redshift
      • method1: create external schema from data catalog
      • method2: create external schema directly from s3
    • External data can be joined with internal data in Redshift
  • Connect Redshift to Tableau for data visualization and BI

About

This is an ETL project - extracting data from an ecommerce transactional database on RDS, transforming the data using AWS glue job, and loading it to a Redshift data warehouse, and connected it to Tableau for BI


Languages

Language:Python 100.0%