annie0sc / big-data-covid-vaccine

Covid Vaccine, as of Feb 2021, has mixed response amongst the general population. Hence, we are using Twitter data to analyses exactly how much the tweets are impacting on the people.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Big Data - Covid Vaccine

Objective

Covid-19 is currently the topic of research this season. We have decided to work on analyzing the recovery rates of patients based on their age, gender, and other criteria. To achieve this, we will be using a dataset from Kaggle to process static data, then we will move to live streaming data from Twitter. Apache Hadoop will be used as the file system, Apache Flink will be used for live streaming data from Twitter and Python will be tying together this whole project, hence the name: PyFlink-Covid-Vaccine.

Meet the Team


Swaroop Reddy


Annie Chandolu


Alekhya Jaddu


Tejaswi Reddy Kandula


Naga Anshitha Velagapudi


Harika Kulkarni

Datasets Used

Static Dataset:

Streaming Data:

  • We will be using live streaming data, tweets, from Twitter as part of our future improvements to the project.

Tech Stack

Tasks/Issues

  • Swaroop Reddy - Going to work on HDFS (Hadoop) MapReduce Programming Model.
  • Annie Samarpitha - I will be working with Alekhya on Python programming.
  • Alekhya Jaddu - Will be working on the programming part using Python scripts.
  • Tejaswi Reddy Kandula - Going to work on Shell Scripting.
  • Naga Anshitha Velagapudi - Going to work on Flink which is used to process data streams in large data.
  • Harika Kulkarni - Will be working on Flink.

SubTopics:

  1. Swaroop Reddy Gottigundala- Writing a Flink python datastream API program.
  2. Annie Samarpitha Chandolu- Analysis on weekly-cases and deaths-weekly counts.
  3. Alekhya Jaddu - Wordcount using pyFlink.
  4. Tejaswi Reddy Kandula - Wordcount for all covidcases using pyFlink for the dataset covid_19_clean_complete
  5. Naga Anshitha Velagapudi - Analyzing number of times/days the country had taken vaccinations.
  6. Harika Kulkarni - countrywise highest recovery rates versus death rates.

Prerequisites

  • Apache Flink
  • Pip
  • Python(3.6.0 to 3.8.0 version)

Description

Apache Flink

  • Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
  • Flink also provides batch processing, graph processing, Itearative proccessing for Machine learning applications.
  • Flink is considered as the next-gen stream processing system.
  • Flink offers substantially higher processing speeds to spark and hadoop.
  • Flink provides low latency and high throughput

ALEKHYA JADDU

Sub-Topic: WordCount using pyFlink

Prerequisites:

  • Apache Flink
  • PIP
  • Python(3.6.0 to 3.8.0 version)

Installation of Python

If any other versions of python are previously installed in your system use the below command to uninstall

choco uninstall python

To install python of a specific version use the below command

choco install python --version=3.8.0

Installation steps for PyFlink

The version of python should be (3.5, 3.6, 3.7 or 3.8) for PyFlink. Please run the following command to make sure that it meets the requirements:

$ python --version

Use the below command to install apache-flink

$ python -m pip install apache-flink 

You can also build PyFlink from source by following the development guide.

Note Starting from Flink 1.11, it’s also supported to run PyFlink jobs locally on Windows and so you could develop and debug PyFlink jobs on Windows.

Code for Word Count using PyFlink:

wordcount.py

Input file

full_grouped.csv

Output file

output.csv

Demonstration video

https://app.vidgrid.com/view/QTuVfghYRV38

Annie Chandolu

Subtopic: Analysis on weekly-cases and deaths-weekly counts

I am doing an analysis on a Covid dataset which is stored in the following repository:

https://github.com/annie0sc/practice-flink-wordcount

A Preview of my work: VIDEO

https://use.vg/hlz7LW

Harika Kulkarni

Subtopic:Countrywise highest recovery rates versus death rates

I am working on providing Countrywise highest recovery rate versus death rates on Covid data.

Prerequisites:

  • Python
  • Flink
  • pip

Google Colab:

Colab is a Python development environment that runs in the browser using Google Cloud. We can perform the following using Google Colab.

  • Write and execute code in Python
  • Document your code that supports mathematical equations
  • Create/Upload/Share notebooks
  • Import/Save notebooks from/to Google Drive
  • Import/Publish notebooks from GitHub
  • Import external datasets e.g. from Kaggle
  • Integrate PyTorch, TensorFlow, Keras, OpenCV
  • Free Cloud service with free GPU

Steps for Using Google colab:

Input File: Link to input file

  • Step1: As Colab implicitly uses Google Drive for storing your notebooks, ensure that you are logged in to your Google Drive account before proceeding further.

  • Step2: Open the following URL in your browser: https://colab.research.google.com Your browser would display the following screen (assuming that you are logged into your Google Drive):

  • Step3: Click on the NEW NOTEBOOK link at the bottom of the screen. A new notebook would open up as shown in the screen below.

  • Step4: You will now enter a trivial Python code in the code window and execute it.Enter the following two Python statements in the code window and click on the arrow on the left side of the code window.

  • Step5: Install apache-flink and all necessary packages using below command.

  • Step6: Reading the data from Input file.

  • Step7: Comparing between TotalDeaths and TotalRecovered from Covid data.

Output File:Link to Output file

References:https://flink.apache.org/flink-architecture.html

My Demonstration Video Link:

Video Link

Naga Anshitha velagapudi

Subtopic:

I'm working on analyzing number of times/days a country had taken vaccinations.

Demonstration Video

**Click on view raw to view/access the video

My Repo:

Link

Dataset:

Link

Input:

Link

Output:

Link

References:

  1. Description
  2. Kaggle_Dataset
  3. Code_Reference

Swaroop Reddy

Sub topic: Writing a Flink python datastream API program.

Prerequisites:

  • Python
  • Flink
  • pip

Excution Steps

-Installing Pyflink

  • $ python -m pip install apache-flink
  • packages
  • from pyflink.common.serialization import SimpleStringEncoder
  • from pyflink.common.typeinfo import Types
  • from pyflink.datastream import StreamExecutionEnvironment
  • from pyflink.datastream.connectors import StreamingFileSink
  • you have run this command First, make sure that the output directory doesn’t exist:
  • rm -rf /tmp/output
  • Then you have to run the example you just created on the command line:
  • $ python datastream_tutorial.py
  • Finally, you can see the result on the command line which is in /tmp/output folder:
  • $ find /tmp/output -type f -exec cat {} ;
  • 1,aaa
  • 2,bbb

My repo :

My Demo Vedio link:

References :

About

Covid Vaccine, as of Feb 2021, has mixed response amongst the general population. Hence, we are using Twitter data to analyses exactly how much the tweets are impacting on the people.


Languages

Language:Python 100.0%