martinigoyanes / spark-streaming-flight-predictor

Real time flight cancellation predictor using Kafka, Zookeeper, Spark Streaming, MongoDB and HDFS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spark-streaming-flight-predictor

Real time flight delay predictor using Kafka, Zookeeper, Spark Streaming, MongoDB and HDFS

How to Run:

Google Kubernetes Engine (Kubernetes cluster with Public IP)

  1. Create a project in Google Cloud and open the Google Cloud Shell (https://cloud.google.com/kubernetes-engine/docs/deploy-app-cluster)
  2. Set project in Cloud Shell:
gcloud config set project PROJECT_ID
  1. Create GKE cluster with Autopilot for simplicity:
gcloud container clusters create-auto flight-delay-predictor-cluster --region=europe-west1	
  1. Get credentials for the cluster. You need to get authentication credentials to interact with the cluster. This configures kubectl to use the cluster you created:
gcloud container clusters get-credentials flight-delay-predictor-cluster --region europe-west1	
  1. Apply kubernetes configuration:
git clone https://github.com/martinigoyanes/spark-streaming-flight-predictor.git src
cd src && /bin/bash create-gke-cluster.sh
  1. Get External IP of the service webapp-service and go to EXTERNAL_IP:5000/flights/delays/predict_kafka

Try my working version at http://34.27.214.100:5000/flights/delays/predict_kafka and click on Submit

Docker Compose

Launch with:

/bin/bash launch-docker-compose.sh

If script does not open browser when docker-compose is finished, then go to: http://localhost:5000/flights/delays/predict_kafka and click on Submit To clean and remove the services:

cd docker-compose/ && docker-compose down && cd ..

Minikube (Kubernetes on your machine/one node)

Launch with:

/bin/bash create-minikube-cluster.sh

If script does not open browser when minikube is finished, then do

minikube service webapp-service

and go to WEBAPP_SERVICE-EXTERNAL_IP:30000/flights/delays/predict_kafka click on Submit

Front End Architecture

This diagram shows how the front end architecture works in our flight delay prediction application. The user fills out a form with some basic information in a form on a web page, which is submitted to the server. The server fills out some neccesary fields derived from those in the form like "day of year" and emits a Kafka message containing a prediction request. Spark Streaming is listening on a Kafka queue for these requests, and makes the prediction, storing the result in MongoDB. Meanwhile, the client has received a UUID in the form's response, and has been polling another endpoint every second. Once the data is available in Mongo, the client's next request picks it up. Finally, the client displays the result of the prediction to the user!

This setup is extremely fun to setup, operate and watch. Check out chapters 7 and 8 for more information!

Front End Architecture

Back End Architecture

The back end architecture diagram shows how we train a classifier model using historical data (all flights from 2015) on disk (HDFS or Amazon S3, etc.) to predict flight delays in batch in Spark. We save the model to disk when it is ready. Next, we launch Zookeeper and a Kafka queue. We use Spark Streaming to load the classifier model, and then listen for prediction requests in a Kafka queue. When a prediction request arrives, Spark Streaming makes the prediction, storing the result in MongoDB where the web application can pick it up.

This architecture is extremely powerful, and it is a huge benefit that we get to use the same code in batch and in realtime with PySpark Streaming.

Backend Architecture

Screenshots

Below are some examples of parts of the application we build in this book and in this repo. Check out the book for more!

Airline Entity Page

Each airline gets its own entity page, complete with a summary of its fleet and a description pulled from Wikipedia.

Airline Page

Airplane Fleet Page

We demonstrate summarizing an entity with an airplane fleet page which describes the entire fleet.

Airplane Fleet Page

Flight Delay Prediction UI

We create an entire realtime predictive system with a web front-end to submit prediction requests.

Predicting Flight Delays UI

About

Real time flight cancellation predictor using Kafka, Zookeeper, Spark Streaming, MongoDB and HDFS


Languages

Language:Python 34.3%Language:HTML 24.3%Language:CSS 16.8%Language:JavaScript 9.9%Language:Scala 9.3%Language:Shell 3.0%Language:Dockerfile 2.5%