Big Data - Covid Vaccine

Objective

Covid-19 is currently the topic of research this season. We have decided to work on analyzing the recovery rates of patients based on their age, gender, and other criteria. To achieve this, we will be using a dataset from Kaggle to process static data, then we will move to live streaming data from Twitter. Apache Hadoop will be used as the file system, Apache Flink will be used for live streaming data from Twitter and Python will be tying together this whole project, hence the name: PyFlink-Covid-Vaccine.

Meet the Team

_{Swaroop Reddy}

_{Annie Chandolu}

_{Alekhya Jaddu}

_{Tejaswi Reddy Kandula}

_{Naga Anshitha Velagapudi}

_{Harika Kulkarni}

Datasets Used

Static Dataset:

covid-metrics

Streaming Data:

We will be using live streaming data, tweets, from Twitter as part of our future improvements to the project.

Tech Stack

Progamming Language: Python
Steaming Engine: Flink
Wiki-Link for Flink
File System: Hadoop

Tasks/Issues

Swaroop Reddy - Going to work on HDFS (Hadoop) MapReduce Programming Model.
Annie Samarpitha - I will be working with Alekhya on Python programming.
Alekhya Jaddu - Will be working on the programming part using Python scripts.
Tejaswi Reddy Kandula - Going to work on Shell Scripting.
Naga Anshitha Velagapudi - Going to work on Flink which is used to process data streams in large data.
Harika Kulkarni - Will be working on Flink.

SubTopics:

Swaroop Reddy Gottigundala- Writing a Flink python datastream API program.
Annie Samarpitha Chandolu- Analysis on weekly-cases and deaths-weekly counts.
Alekhya Jaddu - Wordcount using pyFlink.
Tejaswi Reddy Kandula - Wordcount for all covidcases using pyFlink for the dataset covid_19_clean_complete
Naga Anshitha Velagapudi - Analyzing number of times/days the country had taken vaccinations.
Harika Kulkarni - countrywise highest recovery rates versus death rates.

Prerequisites

Apache Flink
Pip
Python(3.6.0 to 3.8.0 version)

Description

Apache Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Flink also provides batch processing, graph processing, Itearative proccessing for Machine learning applications.
Flink is considered as the next-gen stream processing system.
Flink offers substantially higher processing speeds to spark and hadoop.
Flink provides low latency and high throughput

ALEKHYA JADDU

Sub-Topic: WordCount using pyFlink

Prerequisites:

Apache Flink
PIP
Python(3.6.0 to 3.8.0 version)

Installation of Python

If any other versions of python are previously installed in your system use the below command to uninstall

choco uninstall python

To install python of a specific version use the below command

choco install python --version=3.8.0

Installation steps for PyFlink

The version of python should be (3.5, 3.6, 3.7 or 3.8) for PyFlink. Please run the following command to make sure that it meets the requirements:

$ python --version

Use the below command to install apache-flink

$ python -m pip install apache-flink

You can also build PyFlink from source by following the development guide.

Note Starting from Flink 1.11, it’s also supported to run PyFlink jobs locally on Windows and so you could develop and debug PyFlink jobs on Windows.

Annie Chandolu

Subtopic: Analysis on weekly-cases and deaths-weekly counts

I am doing an analysis on a Covid dataset which is stored in the following repository:

https://github.com/annie0sc/practice-flink-wordcount

A Preview of my work: VIDEO

https://use.vg/hlz7LW

Harika Kulkarni

Subtopic:Countrywise highest recovery rates versus death rates

I am working on providing Countrywise highest recovery rate versus death rates on Covid data.

Prerequisites:

Python
Flink
pip

Google Colab:

Colab is a Python development environment that runs in the browser using Google Cloud. We can perform the following using Google Colab.

Write and execute code in Python
Document your code that supports mathematical equations
Create/Upload/Share notebooks
Import/Save notebooks from/to Google Drive
Import/Publish notebooks from GitHub
Import external datasets e.g. from Kaggle
Integrate PyTorch, TensorFlow, Keras, OpenCV
Free Cloud service with free GPU

Steps for Using Google colab:

Input File: Link to input file

Step1: As Colab implicitly uses Google Drive for storing your notebooks, ensure that you are logged in to your Google Drive account before proceeding further.
Step2: Open the following URL in your browser: https://colab.research.google.com Your browser would display the following screen (assuming that you are logged into your Google Drive):
Step3: Click on the NEW NOTEBOOK link at the bottom of the screen. A new notebook would open up as shown in the screen below.
Step4: You will now enter a trivial Python code in the code window and execute it.Enter the following two Python statements in the code window and click on the arrow on the left side of the code window.
Step5: Install apache-flink and all necessary packages using below command.
Step6: Reading the data from Input file.
Step7: Comparing between TotalDeaths and TotalRecovered from Covid data.