EGDAR Sentiment Analysis Data Pipeline on Kubernetes Engine

The Goal of this project is to create a data pipeline which creates a labeled dataset, using which we train a ML Model Pipeline and Deploy a Flask App on a Kubernetes Cluster. Everything is Managed On the Cloud.

Lets Get Started!

This Project has 4 Stages

Annotation Pipeline
- This is the starting point for the main pipeline.
- It Generates a Database of A Labeled Dataset using Azure Text Analytics API
- Entire Database is stored in a AWS S3 bucket
Machine Learning Pipeline
- This is the second pipeline.
- The Database created in the Annotation Pipeline is used to train our model
- The trained model is stored on a S3 bucket
REST Flask App
- The trained model is incubated in a Python Flask REST App
- The Flask App is tested inside a Docker Container
- The Docker Container is Deployed on a Google Cloud Kubernetes Engine
Inference Pipeline
- Inference Pipeline is an Automated Sentiment Analysis Pipeline
- It scrapes EDGAR Earning Call Transcript Data and stores it in the cloud
- Using the Flask Webapp in Stage 3, It predicts the sentiment of the document.

Getting Started

These instructions will get you a copy of the project up and running on your Local Environment using Cloud Infrastructure

git clone www.github.com/Dhruv-Panchal/ml-as-a-service-pipeline

Prerequisites

Python3.7
AWS Account
GCP Account
Microsoft Azure Account

Installing

What things you need to install the software:

pip3 install -r requirements.txt

Steps For Running on AWS EC2 Cloud

Step 1:

Create Multiple AWS S3 Buckets
Configure IAM Role having Full S3 Bucket Access in your local environment. Learn More Here
Create a GCP Account. Get Started Here
Create an Azure Account. Get Started Here
Request a Metaflow Sandbox to run your pipeline on AWS Batch.

Step 2:

Once Everything is setup, Configure Metaflow's Sandbox. Run metaflow configure sandbox on CLI. Enter The API Keys from Step 1
Configure the input/output buckets on AWS S3 and Enter the bucket name in Annotation Pipeline , ML Pipeline , Inference Pipeline and Flask App
Lastly, add the Azure Api Keys Here

Step 3:

Run on CLI

Change the permission of the files

chmod a+x Annotation\ Pipeline/index.py ML\ Pipeline/index.py Inference\ Pipeline/index.py

Running the Annotation Pipeline

./Annotation\ Pipeline/index.py run --with sandbox

Running the Machine Learning Pipeline

./ML\ Pipeline/index.py run --with sandbox

Creating a docker container of the flask app

cd REST\ Flask\ App/
docker build .
docker login --username=yourhubusername --email=youremail@company.com
docker push yourhubusername/reponame

Step 4:

Once the Dockerized Flask App is in the repo in Step 3, Create a Kubernetes Cluster on Google Cloud Product and Deploy your Docker File From Hub. Learn More Here
Now Your Flask App Is Up! and Accessible from Anywhere Across The World!

Step 5:

Add the required Tickerfile bucket location in Inference Pipeline
Add Bucket Location Inference Pipeline
Add the IP Address and Port Number Obtained from The GCP Kubernetes Cluster in Inference Pipeline

Built With

MetaFlow - Data Pipeline Framework
TensorFlow - Machine Learning Model
Docker - Container Environment
Flask - Web Framework
AWS Batch - Cloud Infrastructure for Big Data Pipeline
Azure Text Analytics API - NLP Text Analytics API
Google Cloud Engine - Kubernetes Cluster Engine

Authors

Dhruv Panchal - Research and Development - Linkedin
Kashish Shah - Design, Architect and Deployment - Linkedin
Manogana Mantripragada - Machine Learning Engineer - Linkedin

License

This project is licensed under the Commons Clause License - see the LICENSE.md file for details

Dhruv-Panchal / ml-as-a-service-pipeline