Streaming Data Pipeline and ML Model with Kafka and PyTorch (A CSE6250 Project)

Demo

In the video below (playing at 32x speed), you can see the output from the User Interface, which receives and displays in real time the following signals:

raw streaming data
preprocessed streaming data,
the prediction from the ML model.

demo_x32speed.mp4

Here is the presentation for this project:

Authors

Summary

We reproduced a streaming data pipeline from the paper "Silent trial protocol for evaluation of real-time models at ICU". The pipeline that uses deep learning to evaluate the risk of cardiac arrest with waveform data in real time. The streaming pipeline, orchestrated with Docker Compose, performs the following.

Creates streaming data of physiological signals from MIMIC III Dataset.
Sends the raw streaming data using Kafka.
Preprocesses the raw data using PySpark.
Predicts if patient is at risk of cardiac arrest using PyTorch.
Stores the results in MySQL.
Displays the data and results in real-time using Bokeh.

Instructions for running

Pull this repository into your local.
Install docker locally. Then, pull the docker image btctothemoonz/travisnkento:008 with the command docker pull btctothemoonz/travisnkento:008.
Make sure you pull all data, including the data stored in the data directory. In particular the directory with

data/waveform/physionet.org/p00/p000194
data/waveform/physionet.org/p04/p004980

On a command line prompt, use cd to navigate to this directory.
Then, run docker compose up. After this command, Kafka, Zookeeper, MySQL, the preprocessing container, the graph plotting container and the deeplearning container are starting.
Please start monitoring the logs on docker compose. You should see a line that repeats every second that states “Countdown 60 seconds: Please manually start a browser http://localhost:5068/ else the server will crash.” Please start a browser and navigate to the link http://localhost:5068/ before the server crashes. Please DO NOT refresh this page upon loading it. Please wait for up to three minutes before the first data point shows up. You should then see the log “Thank you for going to http://localhost:5068/!” within the logs. Then, you can sit back and watch the raw and processed data and predictions appear in http://localhost:5068/.

Initially, only raw data will appear.
Starting at the 3rd minute, processed data will appear.
Starting at the 10th minute, predictions will appear.

You can now wait and see the results being streamed.

About

Streaming Data Pipeline and ML Model with Kafka and PyTorch

MIT License

Languages

Language:Jupyter Notebook 99.7%Language:Python 0.3%Language:HTML 0.0%Language:Dockerfile 0.0%