Eugeme / pyspark-etl-pipeline

Python ETL pipeline with SQL, PySpark, Docker

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Case

A retail sales dataset has to be ingested into the company's Datalake. It is expected that some business rules will be added to the final dataset, in addition to some transformations.

Technical requirement

  • raw data transformed from csv file to a pyspark dataframe
  • main dataframe splitted to customers and sales dataframes
  • for customers dataframe and <quantity of orders in last 5 days> per customer are calculated and added as separate columns
  • column names rewritten with camelCase format
  • date format is changed
  • final dataframes are written to parquet files

raw data

Screenshot 2022-10-24 at 11 51 50

customers dataframe

Screenshot 2022-10-24 at 11 45 04

sales dataframe

Screenshot 2022-10-24 at 11 45 46

About

Python ETL pipeline with SQL, PySpark, Docker


Languages

Language:Python 96.2%Language:Dockerfile 3.8%