PujitH-V / ETL_with_Pyspark_-_SparkSQL

A sample project designed to demonstrate ETL process using Pyspark & Spark SQL API in Apache Spark.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ETL_with_Pyspark_-_SparkSQL

A sample project designed to demonstrate ETL process using Pyspark & Spark SQL API in Apache Spark.

In this project I used Apache Sparks's Pyspark and Spark SQL API's to implement the ETL process on the data and finally load the transformed data to a destination source.

I have used Azure Databricks to run my notebooks and to create jobs for my notebooks. To orchestrate the entire workflow, I have used Azure data factory to create the pipelines.

Note: Any resources deployed in azure has an associated price involved. So, user's are wholely responsible for creating and deploying resources to azure and also responsible for all the charges that are incurred if any.

-------------------************************-------------------

main_latest branch:

This branch contains the updated code of the main project that's under main_old branch.

New implementaions/changes:

When compared with the code of main_old branch the number of notebooks and the number of lines of code were decreased to acheive the goal of automating the entire ETL process by creating a single generic notebook that will be used for performing the transformations on the data.

I will be updating this readme file soon with the links of medium post and youtube video where i have clearly explained the new changes that i have done to the old notebooks/code.

About

A sample project designed to demonstrate ETL process using Pyspark & Spark SQL API in Apache Spark.


Languages

Language:HTML 100.0%