spark postgresql elasticsearch pandas kibana sql nosql datapipeline etl data-engineering pyspark python apache-spark apache-airflow dag

EScooter-Data-Pipeline

Technology Stack:

Apache Airflow
Python
Pandas
Spark
PostgreSQL (Relational)
ElasticSearch, Kibana (NoSQL)
Files (csv, json)
OS (MacOS)

Summary

This is a automated data pipeline workflow performing ETL (Extract, Transform, Load) operations using Python programming language with the help of Pandas library where data is processed parallely in disributed envionment using Spark data processing engine

Relational database PostgreSQL acting as Staging Database and NoSQL database ElasticSearch as Data Warehouse, along with that csv and json data files acting as multiple data sources

Data pipeline managemet tool Apache Airflow is used to create and manage automated data workflow efficiently with help of DAG Python, Bash Operators

Problem

We have the transactional RAW data for Escooter organization, where huge number of records are available in the datalake. This transaction needs to be stored in a data warehose for Business Intelligence. Since, this is RAW data, it might be dirty and flawed. We need to make sure useful data lands in warehouse via data pipeline. Pipeline should be fully automated which can run daily and load data into warehouse

Solution

In order to achieve automated data pipeline which will load transactional data into warehouse

First of all we need to clean the data by performing ETL operations with the help of Pandas library and put cleaned data in staging database which would be PostgreSQL in this case.
Once data is loaded successfully in staging db, we will process and validate data in distributed way and parallely with the help of PySpark library with single master and slave node respectively, we will trandsorm and enrich data by adding new useful fields and removing duplicate/unnecessary data
After that we will start loading this new processed data into the warehouse, which is ElasticSearch in this case, where we will be creating different index as per requirement
Once all done, we will clean the data directory and copy the staging files i.e. CSV, JSON into final directory

All above steps will be executed one after another efficiently with the help of Apache Airflow tool, we will create a DAG and respective python process, where each process will perform single task in ordered manner one after another with the help of PythonOperator

The final result is we have a automated pipeline which perfoms ETL operations with help of Pandas and load data into staging database PostgreSQL and then distributed processing using Spark and finally load clean, useful data into the warehouse ElasticSearch

Execution Steps

Install Apache Airflow, PostgreSQL, pgAdmin 4, ElasticSearch, Kibana, Python3, Pandas, PySpark, psycopg2

Move source code .py file in ~/Airflow/dags directory

 $ mv EScooter-Data-Pipeline/Pipeline/e-scooterPipelineSpark.py    ~/Airflow/dags

Start Elastic Search database instance

 $ cd elasticsearch-7.16.2/
 
 $ bin/elasticsearch

Start Kibana GUI instance
```
 $ kibana
```

Open Kibana GUI dashborad

 open https://localhost:5601/ and navigate to dashboard under left panel

Start Spark Master node instance

 $ cd ~/Spark/master-node/sbin/
 
 $ ./start-master.sh

Start Spark Worker node instance

 $ cd ~/Spark/worker-node/sbin/
 $ ./start-worker.sh spark://Sankets-MacBook-Air.local:7077 -p 9911

Run Apache Airflow webserver
```
 $ airflow webserver
```
Run Apache Airflow scheduler
```
 $ airflow scheduler
```

Open Airflow GUI and start pipeline to start execution

Open https://localhost:8080/   - Master Node

Open https://localhost:8081/   - Worker Node

Turn on green button infront of 'e-scooterSparkPipeline' to start workflow and montior

About

An Automated data pipeline using "Apache Airflow" performing "ETL" on RAW data using "Pandas" library then stage data into "PostgreSQL" then process it distributed cluster and parallelly using "Spark" and loaded final useful data into "ElasticSearch" NoSQL DB warehouse

spark postgresql elasticsearch pandas kibana sql nosql datapipeline etl data-engineering pyspark python apache-spark apache-airflow dag

Languages

Language:Python 73.5%Language:Jupyter Notebook 26.5%