Ahmedessamg / Sprints_Covid_graduation_project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🎯 Purpose

This repo represents the final graduation project (covid dataset ) for Sprints Big data masterclass .

Masterclass link:

https://sprints.ai/course/Big-Data-MasterClass-1345046

🍀 Sponsors

This project exists thanks to : ** Eng. Ahmed Reda ** and **Eng. Amr Saleh **

Main Business Requirement

this project aims to Create an automated pipeline workflow from ingestion till visualization for COVID dataset

  1. show on a map the top 10 ranking countries in death rate
  2. show on a map the top 10 ranking countries in testing rate
  3. show the top 10 ranking countries in testing rate on a pie chart

Technical Requirements

  1. Create Folder on the cloudera Virtual Machine

  2. Upload dataset “covid-19.csv” into VM using WinSCP

  3. Load the dataset to HDFS directory using HDFS cli commands in a shell script

  4. Create database on Hive and create schema for each Hive loading stage

    I. 1st Hive staging table for pointing to dataset location to select data from

    II. 2nd Hive ORC table is partitioned by Country and data are loaded dynamically into it to speed query

    III. 3rd Final hive table to generate the final report which will generate output file to be visualized

  5. Create an Oozie workflow actions from to run the HDFS shell script and execute the Hive queries

Power BI dashboard

https://app.powerbi.com/view?r=eyJrIjoiYjFiMDNkYjYtZTRkZC00MWVmLWIyOWYtNWY1M2U1MGQzMTcwIiwidCI6ImM0ZjhmOTIyLWIzZWYtNGI1OS04Y2ExLTkzZjdhYjc2N2NjZSJ9

About


Languages

Language:HiveQL 89.4%Language:Shell 10.6%