MichailParaskevopoulos / covid19-airtraffic-spark-bigquery

✈ A Spark-based ETL Pipeline for the OpenSky and OpenFlights Datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

✈ A Spark-based ETL Pipeline for the OpenSky and OpenFlights Datasets

Motivation

The OpenSky dataset spans flights since January 2019 has been instrumental in analyzing the effect of COVID-19 on the aviation industry. This repository contains a PySpark pipeline to extract and transform the OpenSky dataset, and load it on to Bigquery as a fact table. PySpark pipelines are also included to process data from the OpenFlights dataset to construct dimensional tables with airline and aircraft data. Combined together in a single Data Warehouse, the OpenSky and OpenFlights datasets not only can be used to analyze trends in airtraffic during the COVID-19 pandemic, but also to explore patterns between different airlines and aircraft types.

Description of Files

  • envSetup.sh Shell script that executes datasetFetch.py, downloads and unzips the OpenSky and OpenFlights CSV files, and copies them to Cloud Storage.
  • datasetFetch.py Python script that uses BeautifulSoup to scrap the file paths for the OpenSky dataset (26 files as of March 2021) and dumps them into a JSON file
  • clusterSetup.sh Shell script that enables the required Google Cloud APIs, creates a Dataproc cluster, and submits the PySpark jobs
  • pyspark_airtraffic.py Main PySpark file for the ETL of the OpenSky dataset:
    • The OpenSky dataset files are loaded from the Cloud Storage into a single PySpark dataframe
    • Rows with null values are dropped
    • Date format is modified to be compatible with the BigQuery date data type
    • The month part of the date of each flight is copied to a new column to be used a partioning field
    • A regex expression is used to extract the airline ICAO identifier from the callsign of each flight
    • The origin and destination coordinates are used to calculate the approximate travel distance for each flight and categorize it either as a short-, medium-, or long-haul flight
    • Reduntant columns are dropped
    • The trasnformed dataframe is loaded into BigQuery as a partitioned table based on the month column to imrpove query performance
  • pyspark_airlines.py and pyspark_aircraft_types.py PySpark files for the ETL of dimensional tables for airlines and aircraft types

Data Sources

OpenSky:
Data from the OpenSky Network 2020 were used to construct the airtraffic and aircraft type tables.

Matthias Schäfer, Martin Strohmeier, Vincent Lenders, Ivan Martinovic and Matthias Wilhelm. "Bringing Up OpenSky: A Large-scale ADS-B Sensor Network for Research". In Proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), pages 83-94, April 2014.

Xavier Olive. "traffic, a toolbox for processing and analysing air traffic data." Journal of Open Source Software 4(39), July 2019.

OpenFlights:
Data from OpenFlights were used to construct the airlines dimensional table.

About

✈ A Spark-based ETL Pipeline for the OpenSky and OpenFlights Datasets

License:Apache License 2.0


Languages

Language:Python 73.1%Language:Shell 26.9%